PGPYML

Python is a popular programming language to develop machine learning solutions. You may use it to do some exploratory data analysis and write a model to do predictions over new data. When you are satisfied with the results produced by your model you will deploy it, but this task can be time-consuming. With pgpyml you can easily use your model as an in-database solution and take the advantages of this approach.

Pgpyml is an extension that allows you to use your models written in Python inside PostgreSQL. You can make predictions directly on the data being inserted at your base, you can also create triggers to avoid insertions based on the predictions.

This extension is under development, designed to be compatible with sklearn and is not ready to be used in the production environment.

Instalation

This section will instruct you how to install the pgpyml extension. You must follow the Prerequisites section and choose the pgxnclient instructions or the git installation. Once the installation is done you can create the extension on your database.

Alternatively you can use the Vagrant machine to test it.

Prerequisites

First of all, you will need to install the python libraries that are used to write and run your machine learning models:

pip3 install numpy scikit-learn pandas

This command install the numpy, sklearn and pandas libraries. The pgpyml extension expects that you use sklearn to build your models.

You will need to install the Python extension for PostgreSQL. You can do this with the command:

# Replace the <version> with your PostgreSQL version
apt -y install postgresql-plpython3-<version>

Replace the <version> in the command with the Postgres version that you are using, for example, if you want to install the Python extension to Postgres 14 use:

apt -y install postgresql-plpython3-14

Install with PGNXClient

You can install the extension using the pgnxclient, you will need to install it if you don't already have it:

apt install pgxnclient

And then you can install the pgpyml extension with:

pgxn install pgpyml

Install with Git

You can install the pgpyml with git, cloning the repository and running the instalation script. To do this, first clone the extension's repository:

git clone -b v0.2.0 https://github.com/Minoro/pgpyml.git

Change the v0.2.0 to the desired version. And inside the downloaded folder run the make file:

make install 
make clean

Create the Extension

Once you have installed the extension you can create it on your PostgreSQL:

-- Python Language extension
CREATE EXTENSION plpython3u;

-- This extension
CREATE EXTENSION pgpyml;

Using the Vagrant Machine

The pgpyml repository has a Vagrantfile that you can use to test this extension, the repository also include an example of how you can train and save a model. The example use the UCI Iris Dataset, with are also included in the repository.

To use the vagrant machine you can navigate to the vagrant folder and run:

vagrant up    # Initiate the machine
vagrant ssh   # Acess the machine

After that you will be able to connect to Postgresql on host http://localhost:5555 through your host machine.

Although it is not necessary to use the vagrant file from the extension repository, the examples will be displayed using the paths of the virtual machine.

Train Your Model

You can train your model as you are used to. The Sklearn documentation has many examples that you can use, in this section we will build a minimal example using Python and Sklearn and deploy it on Python.

We will use the Iris Dataset as an example, we will build a model to predict the class of new instances and use it on PostgresSQL.

Let's create a file named train.py with the code:

# Script: train.py
from sklearn.tree import DecisionTreeClassifier
from sklearn import datasets
from joblib import dump, load

iris = datasets.load_iris()
X = iris.data
y = iris.target

model = DecisionTreeClassifier()

model.fit(X, y)
dump(model, '/home/vagrant/models/iris_decision_tree.joblib')

After running this code you will see an file named iris_decision_tree.joblib in the same folder that you runned the training script. This is the model that we will deploy on PostgreSQL. You can read more about persisting your models on Sklearn Documentation.

Deploy on PostgresSQL

Now you have everything you need to deploy your Machine Learning model on PostgresSQL. Run the following command to make a prediction on new data:

SELECT * FROM predict_table_row(
    '/home/vagrant/models/iris_decision_tree.joblib', -- The trained model
    'iris', -- Table with the data
    '{"sepal_length", "sepal_width", "petal_length", "petal_width"}', -- The columns used as feature
    1 -- The ID of your data
);

This command will load your model and use it to predict a single row in your database. The first parameters is the model you want to use, the second one is the table in your database where your data is stored, se third parameter is a array with the name of the columns that hold the data to use with your model, the final parameter is the ID of the data you want to predict.