Datasets

A dataset is a central item in the pod that organizes your project data and label annotations. To facilitate using Dataset items in your datascience workflow, the Dataset class contains methods to convert the data to a popular datascience format, or save a dataset to disk.

class Dataset

Dataset(**kwargs) :: Dataset

The main Dataset class

Dataset.to

Dataset.to(dtype:str, columns:List[str], filter_missing:bool=True)

Converts Dataset to a different format.
Available formats:
list: a 2-dimensional list, containing one dataset entry per row
dict: a list of dicts, where each dict contains {column: value} for each column
pd: a Pandas dataframe

Args: dtype (str): Datatype of the returned dataset columns (List[str]): Column names of the dataset filter_missing (bool, optional): If true, all rows that contain None values are omitted. Defaults to True. Returns: Any: Dataset formatted according to dtype

Dataset.save

Dataset.save(path:Union[Path, str], columns:List[str], filter_missing:bool=True)

Save dataset to CSV.

Usage

To convert the data in the pod to a different format, Dataset implements the Dataset.to method. In the columns argument, you can define which features will be included in your dataset. A column is either a property of an entry in the dataset, or a property of an item connected to an entry in the dataset.

The Pod uses the following schema for Dataset items. Note that the DatasetEntry item is always included, and the actual data can be found by traversing the entry.data Edge.

dataset schema

Now for example, if a dataset is a set of Message items, and the content has to be included as column, data.content would be the column name. If the name of the sender of a message has to be included, data.sender.handle would be a valid column name.

The following example retrieves an example dataset of Message items, and formats them to a Pandas dataframe:

client = PodClient()
client.add_to_schema(Dataset, DatasetEntry)
True
dataset = client.get_dataset("example-dataset")
columns = ["data.content", "data.sender.handle", "annotation.name"]
dataframe = dataset.to("pd", columns=columns)
dataframe.head()
id data.content data.sender.handle annotation.name
0 371dbdda6d854434b256e4d826cbeec0 content_0 account_0 label_0
1 989a77c9bd7c4e7ba3a0d78656c77b06 content_1 account_1 label_1
2 8af555d92cf8406a9e7ad99d9c168360 content_2 account_2 label_2
3 67318a855d844e7da69beebd408871fd content_3 account_3 label_3
4 e7fd9c347dc941c994eb5a56a63354e0 content_4 account_4 label_4