Datasets
To convert the data in the pod to a different format, Dataset
implements the Dataset.to
method. In the columns
argument, you can define which features will be included in your dataset. A column
is either a property of an entry in the dataset, or a property of an item connected to an entry in the dataset.
The Pod uses the following schema for Dataset items. Note that the DatasetEntry
item is always included, and the actual data can be found by traversing the entry.data
Edge.
Now for example, if a dataset is a set of Message
items, and the content has to be included as column, data.content
would be the column name. If the name of the sender
of a message has to be included, data.sender.handle
would be a valid column name.
The following example retrieves an example dataset of Message
items, and formats them to a Pandas dataframe:
client = PodClient()
client.add_to_schema(Dataset, DatasetEntry)
dataset = client.get_dataset("example-dataset")
columns = ["data.content", "data.sender.handle", "annotation.name"]
dataframe = dataset.to("pd", columns=columns)
dataframe.head()