Build a Machine Learning Model for your data app

Introduction

This guide outlines the full process of building a state-of-the-art sentiment classifier plugin. We will tackle the following steps:

  1. Creating a plugin, the backend of your data app, from a template using pymemri
  2. Loading a pretrained Transformer pipeline from 🤗 Hugging Face into your plugin
  3. Testing, configuring, and deploying your plugin, all made very easy using the plugin template
  4. Connect your plugin to your data app from the Memri app
  5. Optional step: preloading the model

For this guide we are assuming that you have python installed and know how to manage environments for python, if you don’t please start with a basic python setup.

Create a plugin from a template using pymemri

The backend of your data app is a Memri plugin, written in python. Memri has a python library that will help you building plugins: pymemri.

The Pymemri template module offers a way to create a project from a template, with everything around setup, testing, and CI preconfigured. We will use a plugin template to build a classifier plugin in this guide, which enables us to write an app for Memri in just a few lines of code!

First things first, lets install pymemri:

pip install pymemri

On the Memri Gitlab (Note that this is a self-hosted gitlab, so you cannot use your account from gitlab.com), create a blank public repository for your plugin and clone the repo. From within the new repo, run the pymemri plugin_from_template CLI and install the plugin.

plugin_from_template --template="classifier_plugin"  --description="A transformer based sentiment analyis plugin" \
                     --install_requires=transformers,sentencepiece,protobuf,torch==1.10.0
pip install -e .

The plugin_from_template call creates the following folder structure:

├── setup.cfg                           
├── setup.py                            
├── Dockerfile                          <- Dockerfile for your plugin, which builds the plugin docker image
├── metadata.json                       <- Metadata for your plugin, the Pymemri frontend uses this during installation
├── .gitignore                          
├── .gitlab-ci.yml                      <- CI/CD for your plugin, which 1) installs your plugin and pod 2) runs tests 3) deploys your plugin
├── sentiment_plugin                    <- Source code for your plugin
│   ├── model.py                        <- Model definition of your classifier
│   ├── plugin.py                       <- Plugin class
│   ├── schema.py                       <- The schema definition for your plugin
│   └── utils.py                        <- Utility functions for plugins: converting items, converting photos, etc.
├── tests                               <- Tests for your plugin 
│   └── test_plugin.py
├── tools                               
│    └── preload.py                     <- You can define logic here that downloads models and assets required to run your plugin

The resulting plugin in sentiment_plugin/plugin.py is the entrypoint of your project. The Plugin.run method is called when the pod runs your plugin. In most cases, you do not have to edit this file as everything is set up correctly by the template. The model used by this plugin is defined in model.py, which we will edit in the next step to define our sentiment classifier.

Load a pretrained Transformer pipeline from 🤗 Hugging Face

We are building a sentiment analysis plugin that could be used by users from different countries, which are owning data in different languages. In this guide we are not training the model from scatch. Hugging Face Models has many suitable pretrained models for this task. A quick search yields the following RoBERTa model (twitter-xlm-roberta-base-sentiment). For more information about this model, read this blogpost (or the original paper). When we called plugin_from_template, we already added some requirements for the libraries required by this model. The model card contains all the code we need to build a functioning sentiment analysis plugin. We are slightly modifying the standard example to deal with messages that are longer than the default max length. We provide two options for you model, both can be implemented by inserting code into the model template in sentiment_plugin/model.py:

  1. Use your own model instead (like the one trained in the previous tutorial (see example)
  2. Use a pretrained model, use model = AutoModelForSequenceClassification.from_pretrained(self.name); tokenizer = AutoTokenizer.from_pretrained(self.name, model_max_length=512) instead
from typing import List, Any

"""Add these lines"""
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
from pymemri.data.loader import load_huggingface_model_for_project

class Model:
    def __init__(self, name: str = "cardiffnlp/twitter-xlm-roberta-base-sentiment", version: str = None):
        self.name = name
        self.version = version

        """Add these lines"""
        model = load_huggingface_model_for_project(project_path="koenvanderveen/sentiment-plugin")
        tokenizer = AutoTokenizer.from_pretrained("distilroberta-base", model_max_length=512)

        self.pipeline = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer, return_all_scores=True, truncation=True)

    def predict(self, x: List[str]) -> List[dict]:

        """Add this line"""
        return self.pipeline(x)

This plugin template assumes a specific output format for Model.predict, which is documented in sentiment_plugin/model.py. If you want to use a different model, make sure that the output format is the same.

Test, configure, and deploy your plugin

To test your plugin, its needs to 1) get data from the pod 2) make predictions on that data 3) write data back to the pod. Step 3 is handled by the template, and we just implemented step 2. We will now manually define step 1. In tests/test_plugin.py, a simple pytest setup is defined. To implement step 1 we write the create_dummy_data method in tests/test_plugin.py, which adds data for our tests to the Pod and returns a query that retrieves said data. In tests/test_plugin.py, we then run the plugin on this data and verify the output.

def create_dummy_data(client: PodClient) -> dict:
    # Add multilinguage Message items to the Pod
    client.add_to_schema(Message)
    client.bulk_action(
        create_items = [Message(content=sample, service="sentiment_test") 
        for sample in ["this is great", "this is awful", "c'est incroyable", "c'est horrible"]]
    )
    
    # Return query parameters to retrieve dummy data
    return {"type": "Message", "service": "sentiment_test"}

Next, start a local pod. You can either check the pod readme, or close your eyes and run this oneliner (assuming you have docker installed)

docker run --rm --init -p 3030:3030 --name pod -v /var/run/docker.sock:/var/run/docker.sock --entrypoint /pod gitlab.memri.io:5050/memri/pod:dev-latest --owners=ANY --insecure-non-tls=0.0.0.0 --plugins-callback-address=http://pod:3030

With a your local Pod running, you can now run your tests using:

pytest

Note that the first time you run this, this may take a minute as your tests are downloading the model! The second time will be much faster.

Connect your plugin to your data app from the Memri app: Push your plugin to gitlab

To be able to use your plugin, you have to publish a docker container with your plugin to the gitlab container registry of your repo. As an example you can look at the registry for the plugin we are creating. You can navigate to the registry of your gitlab repo by going to Packages & Registries -> Container Registry. Using the plugin template, publishing is just a matter of pushing your code to your repo in the dev or prod branch. Let’s try that:

git add -A
git commit -m "publish v0.1"
git push

Now, because of the build_image stage that was defined by the template in our .gitlab-ci.yml, a gitlab ci pipeline is started which builds our plugin docker image as described in our Dockerfile, and automatically uploads it to the registry of your repo. It will take a few minute before the pipeline is completed and the image shows up into your container registry. You can see your ci pipeline and its progress in your repo under CI/CD -> Pipelines, for an example, see this.

Connect your plugin to your data app from the Memri app: configuring your plugin

When installing the plugin in the Memri app, it will read several files (configs, schemas, etc.) and your container generated by the CI from your gitlab. Therefore, we need to provide the frontend with the url to our repo repo, We can simply enter the url (E.g. https://gitlab.memri.io/eelcovdw/sentiment_plugin) to your repo in the frontend.