Start Here. / GUIDES / Load a Dataset and Finetune a Model

Load a Dataset and Finetune a Model

Fine-tuning a text classifier on your Dataset

Memri data apps often contain machine learning models, which can be trained on the labeled data from the Memri app. In this guide, we will load a labeled dataset from the Pod, and use it to fine-tune a RoBERTa text classifier. The main purpose behind this notebook is to run it on your own dataset after labeling your own dataset, by filling in the corresponding dataset name and pod key in the following steps. The output of this tutorial is a model, which can subsequently be used in your data app. As preparation for this tutorial, we used a benchmark dataset and uploaded a subset of it to a Memri Pod, which can be reproduced using this notebook. To make the training speed a bit faster, we use a smaller version of this model: distilRoBERTa. The code for this tutorial is available as a notebook and on google colab.

from IPython.display import clear_output
!pip install pandas transformers torch git+https://gitlab.memri.io/memri/pymemri.git@dev
clear_output()
print("Installed")
Installed

We start by importing the libraries we need for training our model

import os
import random
import textwrap

import pandas as pd
import torch
import transformers
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from transformers.utils import logging

from pymemri.data.itembase import Edge, Item
from pymemri.data.schema import Dataset, Message, CategoricalLabel
from pymemri.data.loader import write_model_to_package_registry
from pymemri.pod.client import PodClient
from getpass import getpass
transformers.utils.logging.set_verbosity_error()
os.environ["WANDB_DISABLED"] = "true"

Loading your dataset

If you already labeled your dataset in the Memri app, it is very easy to use that, just fill in your pod_url, dataset_name, login_key and password_key below. For our example in this guide, we will use the Tweet eval emoji dataset, which has been uploaded in the same structure to the pod. This dataset is freely available through the 馃 Hugging Face datasets library.

### *Define your pod url here*, this is the one for dev.app.memri.io ####
pod_url = "https://dev.pod.memri.io"
### *Define your dataset here* ####
dataset_name = input("dataset_name:") if "dataset_name" not in locals() else dataset_name
### *Define your login key here* ####
login_key = getpass("login Key:") if "login_key" not in locals() else login_key
### *Define your password key here* ####
password_key = getpass("password_key:") if "password_key" not in locals() else password_key

With that set up, we can connect to our pod to get our dataset

# Connect to pod
client = PodClient(
    url=pod_url,
    owner_key=login_key,
    database_key=password_key,
)
client.add_to_schema(CategoricalLabel, Message, Dataset);

Next, we download and inspect the tweet-eval-emoji dataset. All entries in the dataset can be found through the Dataset.entry edge.

dataset = client.get_dataset(dataset_name)

print("The first 3 items in the dataset:")
for item in dataset.entry[:3]:
    data = client.get(item.id).data[0]
    content = textwrap.shorten(data.content if data.content else "", width=40)
    print(item, "data.content:", content)
The first 3 items in the dataset:
DatasetEntry (#d61229fe-8155-4c10-aa77-61bca347b538) data.content: Had je er nog veel last van vorige week?
DatasetEntry (#f98745ef-cc21-4072-aef2-165f1bf85869) data.content: ja ik was woensdag en donderdag wel ziek
DatasetEntry (#3c9561d5-dd60-435e-bd13-438c766e5ccb) data.content: sinds vrijdag geen klachten meer

The first step in training your model is exporting the dataset to a format we can use in Python. The Dataset class in pymemri can format your dataset to various datatypes with the Dataset.to method. In this notebook, we will use Pandas.

The columns argument of Dataset.to defines which features are used. A column is either a property of the items in the Dataset (for example, the content of a Message), or a property of a connected item (the value of a Label connected to the Message).

data = dataset.to("pandas", columns=["data.content", "annotation.labelValue"])
data.head()

id data.content annotation.labelValue
0 d61229fe-8155-4c10-aa77-61bca347b538 Had je er nog veel last van vorige week? posit
1 f98745ef-cc21-4072-aef2-165f1bf85869 ja ik was woensdag en donderdag wel ziek negative
2 3c9561d5-dd60-435e-bd13-438c766e5ccb sinds vrijdag geen klachten meer posit
3 3b264af5-cfb0-4725-8f8a-e08e46595b20 Weet je hoe laat je er ongeveer bent? Dan maak... negative

Finetuning a Hugging Face model

In this guide, we finetune a Hugging Face model on the tweet_eval emoji task. The transformers library contains all code to do the training for us, we only need to define a torch Dataset that contains our data and handles tokenization.

# Hyperparameters
model_name = "distilroberta-base"
batch_size = 32
learning_rate = 1e-3

class TransformerDataset(torch.utils.data.Dataset):
    def __init__(self, data: pd.DataFrame, tokenizer: transformers.PreTrainedTokenizerBase):
        self.data = data
        self.label2idx, self.idx2label = self.get_label_map()
        self.num_labels = len(self.label2idx)
        self.tokenizer = tokenizer
        
    def tokenize(self, message, label=None):
        tokenized = self.tokenizer(message, padding="max_length", truncation=True)
        if label:
            tokenized["label"] = self.label2idx[label]
        return tokenized

    def get_label_map(self):
        unique_labels = data["annotation.labelValue"].unique()
        return {l: i for i, l in enumerate(unique_labels)}, {i: l for i, l in enumerate(unique_labels)}
        
    def __len__(self):
        return len(self.data)
        
    def __getitem__(self, idx):
        # Get the row from self.data, and skip the first column (id).
        return self.tokenize(*self.data.iloc[idx][1:])

tokenizer = AutoTokenizer.from_pretrained(model_name)
dataset = TransformerDataset(data, tokenizer)
Downloading:   0%|          | 0.00/480 [00:00<?, ?B/s]



Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]



Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]



Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Training

The 馃 Transformers library provides all the code we need for training a RoBERTa model. We will use their Trainer class, which handles all training, monitoring and integration with Weights & Biases for us. The 馃 Transformers documentation has a detailed tutorial on fine-tuning models, which can be found here.

# Load model
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=dataset.num_labels,
    id2label=dataset.idx2label
)

# To increase training speed, we will freeze all layers except the classifier head.
for param in model.base_model.parameters():
    param.requires_grad = False
training_args = transformers.TrainingArguments(
    "twitter-emoji-trainer",
    learning_rate=learning_rate,
    per_device_train_batch_size=batch_size,
    logging_steps=1,
    optim="adamw_torch"
)

trainer = transformers.Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset
)
logging.set_verbosity(40)
trainer.train()
Downloading:   0%|          | 0.00/316M [00:00<?, ?B/s]


{'loss': 0.7297, 'learning_rate': 0.0006666666666666666, 'epoch': 1.0}
{'loss': 0.732, 'learning_rate': 0.0003333333333333333, 'epoch': 2.0}
{'loss': 0.7407, 'learning_rate': 0.0, 'epoch': 3.0}
{'train_runtime': 0.6766, 'train_samples_per_second': 17.736, 'train_steps_per_second': 4.434, 'train_loss': 0.7341360052426656, 'epoch': 3.0}





TrainOutput(global_step=3, training_loss=0.7341360052426656, metrics={'train_runtime': 0.6766, 'train_samples_per_second': 17.736, 'train_steps_per_second': 4.434, 'train_loss': 0.7341360052426656, 'epoch': 3.0})

Uploading your model

We trained our model. The last step is to upload the model to the package registry of our repo at gitlab.memri.io, so we can use it in a plugin. To do this, you are required to create a personal project on gitlab.memri.io for your plugin. For this example, we created sentiment-plugin to upload our model to. When running the command below, you will be asked to provide a personal_acces_token (just follow the link), which will be used to upload the model to gitlab on your behalf.

write_model_to_package_registry(model, project_name="sentiment-plugin")
        The first time you are uploading a model you need to create an access_token
        at https://gitlab.memri.io/-/profile/personal_access_tokens?name=Model+Access+token&scopes=api
        Click at the blue button with 'Create personal access token'"
        
Then copy your personal access token from 'Your new personal access token', and paste here: 路路路路路路路路路路
writing config.json to package registry of sentiment-plugin with project id 175
uploading /tmp/config.json
100.00% [100/100 00:00<00:00]
Succesfully uploaded /tmp/config.json
writing pytorch_model.bin to package registry of sentiment-plugin with project id 175
uploading /tmp/pytorch_model.bin
100.00% [100/100 00:02<00:00]
Succesfully uploaded /tmp/pytorch_model.bin

That鈥檚 it, we created our model and made it accesible via the package registry. Check out the next tutorial to see how you can use this model from within your data app.