Load a Dataset and Finetune a Model
Fine-tuning a text classifier on your Dataset
Memri data apps often contain machine learning models, which can be trained on the labeled data from the Memri app. In this guide, we will load a labeled dataset from the Pod, and use it to fine-tune a RoBERTa text classifier. The main purpose behind this notebook is to run it on your own dataset after labeling your own dataset, by filling in the corresponding dataset name and pod key in the following steps. The output of this tutorial is a model, which can subsequently be used in your data app. As preparation for this tutorial, we used a benchmark dataset and uploaded a subset of it to a Memri Pod, which can be reproduced using this notebook. To make the training speed a bit faster, we use a smaller version of this model: distilRoBERTa. The code for this tutorial is available as a notebook and on google colab.
from IPython.display import clear_output
!pip install pandas transformers torch git+https://gitlab.memri.io/memri/pymemri.git@dev
clear_output()
print("Installed")
Installed
We start by importing the libraries we need for training our model
import os
import random
import textwrap
import pandas as pd
import torch
import transformers
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from transformers.utils import logging
from pymemri.data.itembase import Edge, Item
from pymemri.data.schema import Dataset, Message, CategoricalLabel
from pymemri.data.loader import write_model_to_package_registry
from pymemri.pod.client import PodClient
from getpass import getpass
transformers.utils.logging.set_verbosity_error()
os.environ["WANDB_DISABLED"] = "true"
Loading your dataset
If you already labeled your dataset in the Memri app, it is very easy to use that, just fill in your pod_url
, dataset_name
, login_key
and password_key
below. For our example in this guide, we will use the Tweet eval emoji dataset, which has been uploaded in the same structure to the pod. This dataset is freely available through the 馃 Hugging Face datasets library.
### *Define your pod url here*, this is the one for dev.app.memri.io ####
pod_url = "https://dev.pod.memri.io"
### *Define your dataset here* ####
dataset_name = input("dataset_name:") if "dataset_name" not in locals() else dataset_name
### *Define your login key here* ####
login_key = getpass("login Key:") if "login_key" not in locals() else login_key
### *Define your password key here* ####
password_key = getpass("password_key:") if "password_key" not in locals() else password_key
With that set up, we can connect to our pod to get our dataset
# Connect to pod
client = PodClient(
url=pod_url,
owner_key=login_key,
database_key=password_key,
)
client.add_to_schema(CategoricalLabel, Message, Dataset);
Next, we download and inspect the tweet-eval-emoji dataset. All entries in the dataset can be found through the Dataset.entry edge.
dataset = client.get_dataset(dataset_name)
print("The first 3 items in the dataset:")
for item in dataset.entry[:3]:
data = client.get(item.id).data[0]
content = textwrap.shorten(data.content if data.content else "", width=40)
print(item, "data.content:", content)
The first 3 items in the dataset:
DatasetEntry (#d61229fe-8155-4c10-aa77-61bca347b538) data.content: Had je er nog veel last van vorige week?
DatasetEntry (#f98745ef-cc21-4072-aef2-165f1bf85869) data.content: ja ik was woensdag en donderdag wel ziek
DatasetEntry (#3c9561d5-dd60-435e-bd13-438c766e5ccb) data.content: sinds vrijdag geen klachten meer
The first step in training your model is exporting the dataset to a format we can use in Python. The Dataset
class in pymemri can format your dataset to various datatypes with the Dataset.to
method. In this notebook, we will use Pandas.
The columns argument of Dataset.to defines which features are used. A column is either a property of the items in the Dataset (for example, the content of a Message), or a property of a connected item (the value of a Label connected to the Message).
data = dataset.to("pandas", columns=["data.content", "annotation.labelValue"])
data.head()
id | data.content | annotation.labelValue | |
---|---|---|---|
0 | d61229fe-8155-4c10-aa77-61bca347b538 | Had je er nog veel last van vorige week? | posit |
1 | f98745ef-cc21-4072-aef2-165f1bf85869 | ja ik was woensdag en donderdag wel ziek | negative |
2 | 3c9561d5-dd60-435e-bd13-438c766e5ccb | sinds vrijdag geen klachten meer | posit |
3 | 3b264af5-cfb0-4725-8f8a-e08e46595b20 | Weet je hoe laat je er ongeveer bent? Dan maak... | negative |
Finetuning a Hugging Face model
In this guide, we finetune a Hugging Face model on the tweet_eval emoji task. The transformers library contains all code to do the training for us, we only need to define a torch Dataset that contains our data and handles tokenization.
# Hyperparameters
model_name = "distilroberta-base"
batch_size = 32
learning_rate = 1e-3
class TransformerDataset(torch.utils.data.Dataset):
def __init__(self, data: pd.DataFrame, tokenizer: transformers.PreTrainedTokenizerBase):
self.data = data
self.label2idx, self.idx2label = self.get_label_map()
self.num_labels = len(self.label2idx)
self.tokenizer = tokenizer
def tokenize(self, message, label=None):
tokenized = self.tokenizer(message, padding="max_length", truncation=True)
if label:
tokenized["label"] = self.label2idx[label]
return tokenized
def get_label_map(self):
unique_labels = data["annotation.labelValue"].unique()
return {l: i for i, l in enumerate(unique_labels)}, {i: l for i, l in enumerate(unique_labels)}
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
# Get the row from self.data, and skip the first column (id).
return self.tokenize(*self.data.iloc[idx][1:])
tokenizer = AutoTokenizer.from_pretrained(model_name)
dataset = TransformerDataset(data, tokenizer)
Downloading: 0%| | 0.00/480 [00:00<?, ?B/s]
Downloading: 0%| | 0.00/878k [00:00<?, ?B/s]
Downloading: 0%| | 0.00/446k [00:00<?, ?B/s]
Downloading: 0%| | 0.00/1.29M [00:00<?, ?B/s]
The 馃 Transformers library provides all the code we need for training a RoBERTa model. We will use their Trainer class, which handles all training, monitoring and integration with Weights & Biases for us. The 馃 Transformers documentation has a detailed tutorial on fine-tuning models, which can be found here.
# Load model
model = AutoModelForSequenceClassification.from_pretrained(
model_name,
num_labels=dataset.num_labels,
id2label=dataset.idx2label
)
# To increase training speed, we will freeze all layers except the classifier head.
for param in model.base_model.parameters():
param.requires_grad = False
training_args = transformers.TrainingArguments(
"twitter-emoji-trainer",
learning_rate=learning_rate,
per_device_train_batch_size=batch_size,
logging_steps=1,
optim="adamw_torch"
)
trainer = transformers.Trainer(
model=model,
args=training_args,
train_dataset=dataset
)
logging.set_verbosity(40)
trainer.train()
Downloading: 0%| | 0.00/316M [00:00<?, ?B/s]
{'loss': 0.7297, 'learning_rate': 0.0006666666666666666, 'epoch': 1.0}
{'loss': 0.732, 'learning_rate': 0.0003333333333333333, 'epoch': 2.0}
{'loss': 0.7407, 'learning_rate': 0.0, 'epoch': 3.0}
{'train_runtime': 0.6766, 'train_samples_per_second': 17.736, 'train_steps_per_second': 4.434, 'train_loss': 0.7341360052426656, 'epoch': 3.0}
TrainOutput(global_step=3, training_loss=0.7341360052426656, metrics={'train_runtime': 0.6766, 'train_samples_per_second': 17.736, 'train_steps_per_second': 4.434, 'train_loss': 0.7341360052426656, 'epoch': 3.0})
Uploading your model
We trained our model. The last step is to upload the model to the package registry of our repo at gitlab.memri.io, so we can use it in a plugin. To do this, you are required to create a personal project on gitlab.memri.io for your plugin. For this example, we created sentiment-plugin to upload our model to. When running the command below, you will be asked to provide a personal_acces_token (just follow the link), which will be used to upload the model to gitlab on your behalf.
write_model_to_package_registry(model, project_name="sentiment-plugin")
The first time you are uploading a model you need to create an access_token
at https://gitlab.memri.io/-/profile/personal_access_tokens?name=Model+Access+token&scopes=api
Click at the blue button with 'Create personal access token'"
Then copy your personal access token from 'Your new personal access token', and paste here: 路路路路路路路路路路
writing config.json to package registry of sentiment-plugin with project id 175
uploading /tmp/config.json
Succesfully uploaded /tmp/config.json
writing pytorch_model.bin to package registry of sentiment-plugin with project id 175
uploading /tmp/pytorch_model.bin
Succesfully uploaded /tmp/pytorch_model.bin
That鈥檚 it, we created our model and made it accesible via the package registry. Check out the next tutorial to see how you can use this model from within your data app.