Dataset Basics

Intro to HuggingFace datasets

These are some notes on the basics of working with HF datasets. These are very important if you want to fine tune LLMs because you will be downloading / uploading datasets from the Hub frequently.

Highlights

dataset.map does some kind of dict merge so `dataset.map(…) that emits a new dict key will add an additional field.
For LLM instruction tuning, you likely want some fields like features: ['output', 'instruction', 'input'].
You can stream data ds = load_dataset("bigcode/the-stack", streaming=True, split="train")
using batched=True is a good way to speed things up
you can go back and forth from pandas dataframes which is handy for data manipulation.

Dataset Quickstart

Following notes on this page

from datasets import load_dataset
dataset = load_dataset("glue", "mrpc", split="train")

Found cached dataset glue (/Users/hamel/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)

dataset[0]

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
 'label': 1,
 'idx': 0}

dataset

Dataset({
    features: ['sentence1', 'sentence2', 'label', 'idx'],
    num_rows: 3668
})

Tokenize Data

You will want to tokenize examples in this case

from transformers import AutoModelForSequenceClassification, AutoTokenizer
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

With just one input, the token_type_ids are the same:

tokenizer('hello what is going on')

{'input_ids': [101, 7592, 2054, 2003, 2183, 2006, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

With two inputs, the token_type_ids are indexed accordingly:

out = tokenizer('hello what is going on?',  'I am here.')
out

{'input_ids': [101, 7592, 2054, 2003, 2183, 2006, 1029, 102, 1045, 2572, 2182, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

groups = [[], []]
for i,tt in zip(out['input_ids'], out['token_type_ids']):
    groups[tt].append(i)
    
for g in groups:
    print(tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(g)))

[CLS] hello what is going on? [SEP]
i am here. [SEP]

def encode(examples):
    return tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, padding="max_length")

tds = dataset.map(encode, batched=True)
tds

Dataset({
    features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 3668
})

Add additional field / change col name

The quickstart says that the model requires the field name labels. How would we know? We can look at the forward method of the model:

help(model.forward)

Help on method forward in module transformers.models.bert.modeling_bert:

forward(input_ids: Optional[torch.Tensor] = None, attention_mask: Optional[torch.Tensor] = None, token_type_ids: Optional[torch.Tensor] = None, position_ids: Optional[torch.Tensor] = None, head_mask: Optional[torch.Tensor] = None, inputs_embeds: Optional[torch.Tensor] = None, labels: Optional[torch.Tensor] = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None) -> Union[Tuple[torch.Tensor], transformers.modeling_outputs.SequenceClassifierOutput] method of transformers.models.bert.modeling_bert.BertForSequenceClassification instance
    The [`BertForSequenceClassification`] forward method, overrides the `__call__` special method.
    
    <Tip>
    
    Although the recipe for forward pass needs to be defined within this function, one should call the [`Module`]
    instance afterwards instead of this since the former takes care of running the pre and post processing steps while
    the latter silently ignores them.
    
    </Tip>
    
    Args:
        input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
            Indices of input sequence tokens in the vocabulary.
    
            Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
            [`PreTrainedTokenizer.__call__`] for details.
    
            [What are input IDs?](../glossary#input-ids)
        attention_mask (`torch.FloatTensor` of shape `(batch_size, sequence_length)`, *optional*):
            Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
    
            - 1 for tokens that are **not masked**,
            - 0 for tokens that are **masked**.
    
            [What are attention masks?](../glossary#attention-mask)
        token_type_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
            Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0,
            1]`:
    
            - 0 corresponds to a *sentence A* token,
            - 1 corresponds to a *sentence B* token.
    
            [What are token type IDs?](../glossary#token-type-ids)
        position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
            Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
            config.max_position_embeddings - 1]`.
    
            [What are position IDs?](../glossary#position-ids)
        head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
            Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`:
    
            - 1 indicates the head is **not masked**,
            - 0 indicates the head is **masked**.
    
        inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
            Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
            is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
            model's internal embedding lookup matrix.
        output_attentions (`bool`, *optional*):
            Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
            tensors for more detail.
        output_hidden_states (`bool`, *optional*):
            Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
            more detail.
        return_dict (`bool`, *optional*):
            Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
    
        labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
            Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
        
    Returns:
        [`transformers.modeling_outputs.SequenceClassifierOutput`] or `tuple(torch.FloatTensor)`: A [`transformers.modeling_outputs.SequenceClassifierOutput`] or a tuple of
        `torch.FloatTensor` (if `return_dict=False` is passed or when `config.return_dict=False`) comprising various
        elements depending on the configuration ([`BertConfig`]) and inputs.
    
        - **loss** (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided) -- Classification (or regression if config.num_labels==1) loss.
        - **logits** (`torch.FloatTensor` of shape `(batch_size, config.num_labels)`) -- Classification (or regression if config.num_labels==1) scores (before SoftMax).
        - **hidden_states** (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`) -- Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
          one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
    
          Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
        - **attentions** (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`) -- Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
          sequence_length)`.
    
          Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
          heads.
    
    Example of single-label classification:
    
    ```python
    >>> import torch
    >>> from transformers import AutoTokenizer, BertForSequenceClassification
    
    >>> tokenizer = AutoTokenizer.from_pretrained("textattack/bert-base-uncased-yelp-polarity")
    >>> model = BertForSequenceClassification.from_pretrained("textattack/bert-base-uncased-yelp-polarity")
    
    >>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
    
    >>> with torch.no_grad():
    ...     logits = model(**inputs).logits
    
    >>> predicted_class_id = logits.argmax().item()
    >>> model.config.id2label[predicted_class_id]
    'LABEL_1'
    
    >>> # To train a model on `num_labels` classes, you can pass `num_labels=num_labels` to `.from_pretrained(...)`
    >>> num_labels = len(model.config.id2label)
    >>> model = BertForSequenceClassification.from_pretrained("textattack/bert-base-uncased-yelp-polarity", num_labels=num_labels)
    
    >>> labels = torch.tensor([1])
    >>> loss = model(**inputs, labels=labels).loss
    >>> round(loss.item(), 2)
    0.01
    ```
    
    Example of multi-label classification:
    
    ```python
    >>> import torch
    >>> from transformers import AutoTokenizer, BertForSequenceClassification
    
    >>> tokenizer = AutoTokenizer.from_pretrained("textattack/bert-base-uncased-yelp-polarity")
    >>> model = BertForSequenceClassification.from_pretrained("textattack/bert-base-uncased-yelp-polarity", problem_type="multi_label_classification")
    
    >>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
    
    >>> with torch.no_grad():
    ...     logits = model(**inputs).logits
    
    >>> predicted_class_ids = torch.arange(0, logits.shape[-1])[torch.sigmoid(logits).squeeze(dim=0) > 0.5]
    
    >>> # To train a model on `num_labels` classes, you can pass `num_labels=num_labels` to `.from_pretrained(...)`
    >>> num_labels = len(model.config.id2label)
    >>> model = BertForSequenceClassification.from_pretrained(
    ...     "textattack/bert-base-uncased-yelp-polarity", num_labels=num_labels, problem_type="multi_label_classification"
    ... )
    
    >>> labels = torch.sum(
    ...     torch.nn.functional.one_hot(predicted_class_ids[None, :].clone(), num_classes=num_labels), dim=1
    ... ).to(torch.float)
    >>> loss = model(**inputs, labels=labels).loss
    ```

Change label to labels

tds = tds.map(lambda examples: {"labels": examples["label"]}, batched=True)
tds

Dataset({
    features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask', 'labels'],
    num_rows: 3668
})

Turn Dataset into a pytorch dataloader

import torch

tds.set_format(type="torch", columns=["input_ids", "token_type_ids", "attention_mask", "labels"])
dataloader = torch.utils.data.DataLoader(tds, batch_size=32)

Wikipedia Dataset

There seems to be many subsets. This is the page

ds = load_dataset("wikitext", "wikitext-2-v1", streaming=True, split="validation")

example = ds.take(5)

ds.description

' The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified\n Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike\n License.\n'

list(example)

[{'text': ''},
 {'text': ' = Homarus gammarus = \n'},
 {'text': ''},
 {'text': ' Homarus gammarus , known as the European lobster or common lobster , is a species of <unk> lobster from the eastern Atlantic Ocean , Mediterranean Sea and parts of the Black Sea . It is closely related to the American lobster , H. americanus . It may grow to a length of 60 cm ( 24 in ) and a mass of 6 kilograms ( 13 lb ) , and bears a conspicuous pair of claws . In life , the lobsters are blue , only becoming " lobster red " on cooking . Mating occurs in the summer , producing eggs which are carried by the females for up to a year before hatching into <unk> larvae . Homarus gammarus is a highly esteemed food , and is widely caught using lobster pots , mostly around the British Isles . \n'},
 {'text': ''}]

Loading Custom Dataset

You can load a dataset from csv, tsv, text, json, jsonl, dataframes You can also point to a url

from datasets import load_dataset
ds = load_dataset("csv", data_files="https://github.com/datablist/sample-csv-files/raw/main/files/customers/customers-500000.zip")

Using custom data configuration default-6e1837ea838b9492
Found cached dataset csv (/Users/hamel/.cache/huggingface/datasets/csv/default-6e1837ea838b9492/0.0.0/652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a)

ds

DatasetDict({
    train: Dataset({
        features: ['Index', 'Customer Id', 'First Name', 'Last Name', 'Company', 'City', 'Country', 'Phone 1', 'Phone 2', 'Email', 'Subscription Date', 'Website'],
        num_rows: 500000
    })
})

ds['train'][0]

{'Index': 1,
 'Customer Id': 'e685B8690f9fbce',
 'First Name': 'Erik',
 'Last Name': 'Little',
 'Company': 'Blankenship PLC',
 'City': 'Caitlynmouth',
 'Country': 'Sao Tome and Principe',
 'Phone 1': '457-542-6899',
 'Phone 2': '055.415.2664x5425',
 'Email': 'shanehester@campbell.org',
 'Subscription Date': '2021-12-23',
 'Website': 'https://wagner.com/'}

Transformations

`map`

def fullnm(d): return {'Full Name': d['First Name'] + ' ' + d['Last Name']}
ds = ds.map(fullnm)

ds['train'][0]

{'Index': 1,
 'Customer Id': 'e685B8690f9fbce',
 'First Name': 'Erik',
 'Last Name': 'Little',
 'Company': 'Blankenship PLC',
 'City': 'Caitlynmouth',
 'Country': 'Sao Tome and Principe',
 'Phone 1': '457-542-6899',
 'Phone 2': '055.415.2664x5425',
 'Email': 'shanehester@campbell.org',
 'Subscription Date': '2021-12-23',
 'Website': 'https://wagner.com/',
 'Full Name': 'Erik Little'}

`batched=True` for `map`

You operate over a list instead of single items, this can usually speed things up a bit. The below example is significantly faster than the default.

per the docs:

list comprehensions are usually faster than executing the same code in a for loop, and we also gain some performance by accessing lots of elements at the same time instead of one by one.

Using Dataset.map() with batched=True will be essential to unlock the speed of the “fast” tokenizers

def fullnm_batched(d): return {'Full Name': [f + ' ' + l for f,l in zip(d['First Name'], d['Last Name'])]}
ds.map(fullnm_batched, batched=True)

DatasetDict({
    train: Dataset({
        features: ['Index', 'Customer Id', 'First Name', 'Last Name', 'Company', 'City', 'Country', 'Phone 1', 'Phone 2', 'Email', 'Subscription Date', 'Website', 'Full Name'],
        num_rows: 500000
    })
})

`batched=True` speed test

HF tokenizers can work with or without batch=True, let’s see the difference, first let’s make a text field, let’s use a dataset with a larger text field:

from datasets import set_caching_enabled
set_caching_enabled(False)

tds = load_dataset("csv",
                   data_files='https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip',
                   delimiter="\t");

Using custom data configuration default-3340c354bf896b6f
Found cached dataset csv (/Users/hamel/.cache/huggingface/datasets/csv/default-3340c354bf896b6f/0.0.0/652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a)

tds['train']['review'][0]

'"I&#039;ve tried a few antidepressants over the years (citalopram, fluoxetine, amitriptyline), but none of those helped with my depression, insomnia &amp; anxiety. My doctor suggested and changed me onto 45mg mirtazapine and this medicine has saved my life. Thankfully I have had no side effects especially the most common - weight gain, I&#039;ve actually lost alot of weight. I still have suicidal thoughts but mirtazapine has saved me."'

def tokenize_function(examples): return tokenizer(examples["review"], truncation=True)

Without `batched`

%time tds.map(tokenize_function)

CPU times: user 1min 21s, sys: 1.71 s, total: 1min 23s
Wall time: 1min 23s

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 215063
    })
})

With `batched`

19 Seconds!

%time tds.map(tokenize_function, batched=True)

CPU times: user 1min 5s, sys: 1.18 s, total: 1min 6s
Wall time: 1min 6s

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 215063
    })
})

Multicore

15.7s!

for values of num_proc other than 8, our tests showed that it was faster to use batched=True without that option. In general, we don’t recommend using Python multiprocessing for fast tokenizers with batched=True.

%time tds.map(tokenize_function, batched=True, num_proc=8)

CPU times: user 911 ms, sys: 533 ms, total: 1.44 s
Wall time: 18.1 s

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 215063
    })
})

`select`

Good to see a preview of different rows

sample = ds['train'].shuffle(seed=42).select(range(10))
sample[:2]

{'Index': [209712, 246986],
 'Customer Id': ['fad0d3B75B73cd7', 'D75eCaeAc8C6BD6'],
 'First Name': ['Jo', 'Judith'],
 'Last Name': ['Pittman', 'Thomas'],
 'Company': ['Pineda-Hobbs', 'Mcguire, Alvarado and Kennedy'],
 'City': ['Traciestad', 'Palmerfort'],
 'Country': ['Finland', 'Tonga'],
 'Phone 1': ['001-086-011-7063', '+1-495-667-1061x21703'],
 'Phone 2': ['853-679-2287x631', '589.777.0504'],
 'Email': ['gsantos@stuart.biz', 'vchung@bowman.com'],
 'Subscription Date': ['2020-08-04', '2021-08-14'],
 'Website': ['https://www.bautista.com/', 'https://wilkerson.org/'],
 'Full Name': ['Jo Pittman', 'Judith Thomas']}

`unique`

len(ds['train'].unique('Index')), ds.num_rows

(500000, {'train': 500000})

`rename_column`

ds = ds.rename_column('Phone 1', new_column_name='Primary Phone Number')

`filter`

def erik(d): return d['First Name'].lower() == 'erik'

e_ds = ds.filter(erik)

e_ds['train'].select(range(5))['First Name']

['Erik', 'Erik', 'Erik', 'Erik', 'Erik']

`sort`

ds['train'].sort('First Name').select(range(10))[:3]

{'Index': [491821, 170619, 212021],
 'Customer Id': ['84C747dDFac8Dc7', '5886eaffEF8dc6D', 'B8a6cFab936Fb2A'],
 'First Name': ['Aaron', 'Aaron', 'Aaron'],
 'Last Name': ['Hull', 'Cain', 'Mays'],
 'Company': ['Morrow Inc', 'Mccormick-Hardy', 'Hopkins-Larson'],
 'City': ['West Charles', 'West Connie', 'Mccallchester'],
 'Country': ['Netherlands', 'Vanuatu', 'Ecuador'],
 'Primary Phone Number': ['670-796-3507',
  '323-296-0014',
  '(594)960-9651x17240'],
 'Phone 2': ['001-917-832-0423x324',
  '+1-551-114-3103x05351',
  '996.174.5737x6442'],
 'Email': ['ivan16@bender.org',
  'shelley82@bender.org',
  'qrhodes@stokes-larson.info'],
 'Subscription Date': ['2020-05-28', '2021-04-11', '2022-03-19'],
 'Website': ['http://carney-lawson.info/',
  'http://www.wiggins.biz/',
  'http://pugh.com/'],
 'Full Name': ['Aaron Hull', 'Aaron Cain', 'Aaron Mays']}

Dataframes from datasets

set_format seems to work in place:

ds.set_format('pandas')

ds['train'][:5]

	Index	Customer Id	First Name	Last Name	Company	City	Country	Primary Phone Number	Phone 2	Email	Subscription Date	Website	Full Name
0	1	e685B8690f9fbce	Erik	Little	Blankenship PLC	Caitlynmouth	Sao Tome and Principe	457-542-6899	055.415.2664x5425	shanehester@campbell.org	2021-12-23	https://wagner.com/	Erik Little
1	2	6EDdBA3a2DFA7De	Yvonne	Shaw	Jensen and Sons	Janetfort	Palestinian Territory	9610730173	531-482-3000x7085	kleinluis@vang.com	2021-01-01	https://www.paul.org/	Yvonne Shaw
2	3	b9Da13bedEc47de	Jeffery	Ibarra	Rose, Deleon and Sanders	Darlenebury	Albania	(840)539-1797x479	209-519-5817	deckerjamie@bartlett.biz	2020-03-30	https://www.morgan-phelps.com/	Jeffery Ibarra
3	4	710D4dA2FAa96B5	James	Walters	Kline and Sons	Donhaven	Bahrain	+1-985-596-1072x3040	(528)734-8924x054	dochoa@carey-morse.com	2022-01-18	https://brennan.com/	James Walters
4	5	3c44ed62d7BfEBC	Leslie	Snyder	Price, Mason and Doyle	Mossfort	Central African Republic	812-016-9904x8231	254.631.9380	darrylbarber@warren.org	2020-01-25	http://www.trujillo-sullivan.info/	Leslie Snyder

You can get a proper pandas dataframe like this:

🚨 Under the hood, Dataset.set_format() changes the return format for the dataset’s getitem() dunder method. This means that when we want to create a new object like train_df from a Dataset in the “pandas” format, we need to slice the whole dataset to obtain a pandas.DataFrame. You can verify for yourself that the type of drug_dataset[“train”] is Dataset, irrespective of the output format.

df = ds['train'][:]
type(df)

pandas.core.frame.DataFrame

Datasets from DataFrames

This is going the other direction df -> ds

new_ds = dataset.from_pandas(df)
new_ds

Dataset({
    features: ['Index', 'Customer Id', 'First Name', 'Last Name', 'Company', 'City', 'Country', 'Primary Phone Number', 'Phone 2', 'Email', 'Subscription Date', 'Website', 'Full Name'],
    num_rows: 500000
})

new_ds[:2]

{'Index': [1, 2],
 'Customer Id': ['e685B8690f9fbce', '6EDdBA3a2DFA7De'],
 'First Name': ['Erik', 'Yvonne'],
 'Last Name': ['Little', 'Shaw'],
 'Company': ['Blankenship PLC', 'Jensen and Sons'],
 'City': ['Caitlynmouth', 'Janetfort'],
 'Country': ['Sao Tome and Principe', 'Palestinian Territory'],
 'Primary Phone Number': ['457-542-6899', '9610730173'],
 'Phone 2': ['055.415.2664x5425', '531-482-3000x7085'],
 'Email': ['shanehester@campbell.org', 'kleinluis@vang.com'],
 'Subscription Date': ['2021-12-23', '2021-01-01'],
 'Website': ['https://wagner.com/', 'https://www.paul.org/'],
 'Full Name': ['Erik Little', 'Yvonne Shaw']}

Reset the format

Note you can reset the format at anytime:

new_ds.set_format('pandas')
type(new_ds[:3])

pandas.core.frame.DataFrame

new_ds.reset_format()
type(new_ds[:3])

dict

Creating data partitions

train/test etc.

split_ds = new_ds.train_test_split(train_size=0.8, seed=42)

split_ds

DatasetDict({
    train: Dataset({
        features: ['Index', 'Customer Id', 'First Name', 'Last Name', 'Company', 'City', 'Country', 'Primary Phone Number', 'Phone 2', 'Email', 'Subscription Date', 'Website', 'Full Name'],
        num_rows: 400000
    })
    test: Dataset({
        features: ['Index', 'Customer Id', 'First Name', 'Last Name', 'Company', 'City', 'Country', 'Primary Phone Number', 'Phone 2', 'Email', 'Subscription Date', 'Website', 'Full Name'],
        num_rows: 100000
    })
})

You can create new partitions without train_test_split explicitly by creating a new group like this:

split_ds2 = split_ds['train'].train_test_split(train_size=0.8)

split_ds['train'] = split_ds2['train']
split_ds['validation'] = split_ds2['test']

split_ds

DatasetDict({
    train: Dataset({
        features: ['Index', 'Customer Id', 'First Name', 'Last Name', 'Company', 'City', 'Country', 'Primary Phone Number', 'Phone 2', 'Email', 'Subscription Date', 'Website', 'Full Name'],
        num_rows: 320000
    })
    test: Dataset({
        features: ['Index', 'Customer Id', 'First Name', 'Last Name', 'Company', 'City', 'Country', 'Primary Phone Number', 'Phone 2', 'Email', 'Subscription Date', 'Website', 'Full Name'],
        num_rows: 100000
    })
    validation: Dataset({
        features: ['Index', 'Customer Id', 'First Name', 'Last Name', 'Company', 'City', 'Country', 'Primary Phone Number', 'Phone 2', 'Email', 'Subscription Date', 'Website', 'Full Name'],
        num_rows: 80000
    })
})

Saving & Loading Datasets

Let’s save our ds dataset to disk:

new_ds

Dataset({
    features: ['Index', 'Customer Id', 'First Name', 'Last Name', 'Company', 'City', 'Country', 'Primary Phone Number', 'Phone 2', 'Email', 'Subscription Date', 'Website', 'Full Name'],
    num_rows: 500000
})

new_ds.save_to_disk('tabular_data')

!tree tabular_data

tabular_data

├── dataset.arrow

├── dataset_dict.json

├── dataset_info.json

├── state.json

└── train

    ├── dataset.arrow

    ├── dataset_info.json

    └── state.json



1 directory, 7 files

Load the data now from disk

from_disk_ds = dataset.load_from_disk('tabular_data')

from_disk_ds

Dataset({
    features: ['Index', 'Customer Id', 'First Name', 'Last Name', 'Company', 'City', 'Country', 'Primary Phone Number', 'Phone 2', 'Email', 'Subscription Date', 'Website', 'Full Name'],
    num_rows: 500000
})

Streaming a `dataset`

When you set streaming=True you are returned a IterableDataset object.

sds = load_dataset("wikitext", "wikitext-2-v1", 
                  streaming=True, split="validation")
type(sds)

datasets.iterable_dataset.IterableDataset

`take` and `skip`

These are special methods for IterableDataset, these will not work for a regular dataset

list(sds.take(4))

[{'text': ''},
 {'text': ' = Homarus gammarus = \n'},
 {'text': ''},
 {'text': ' Homarus gammarus , known as the European lobster or common lobster , is a species of <unk> lobster from the eastern Atlantic Ocean , Mediterranean Sea and parts of the Black Sea . It is closely related to the American lobster , H. americanus . It may grow to a length of 60 cm ( 24 in ) and a mass of 6 kilograms ( 13 lb ) , and bears a conspicuous pair of claws . In life , the lobsters are blue , only becoming " lobster red " on cooking . Mating occurs in the summer , producing eggs which are carried by the females for up to a year before hatching into <unk> larvae . Homarus gammarus is a highly esteemed food , and is widely caught using lobster pots , mostly around the British Isles . \n'}]

foo = list(sds.skip(100))

len(foo), len(list(sds))

(3660, 3760)

you can use itertools.islice to get multiple items:

from itertools import islice
len(list(islice(sds, 5)))

The old way looks like this:

nds = load_dataset("wikitext", "wikitext-2-v1", 
                   split="validation")
type(nds)

Found cached dataset wikitext (/Users/hamel/.cache/huggingface/datasets/wikitext/wikitext-2-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126)

datasets.arrow_dataset.Dataset

ds

DatasetDict({
    train: Dataset({
        features: ['Index', 'Customer Id', 'First Name', 'Last Name', 'Company', 'City', 'Country', 'Primary Phone Number', 'Phone 2', 'Email', 'Subscription Date', 'Website', 'Full Name'],
        num_rows: 500000
    })
})

Uploading Datset to the Hub

See the docs

Login & upload with notebook

from huggingface_hub import notebook_login
notebook_login()

from_disk_ds = dataset.load_from_disk('tabular_data')

remote_name = 'hamel/tabular-data-test'
from_disk_ds.push_to_hub(remote_name)

Updating downloaded metadata with the new split.

Using the cli

!huggingface-cli --help

usage: huggingface-cli <command> [<args>]

positional arguments:
  {login,whoami,logout,repo,lfs-enable-largefiles,lfs-multipart-upload}
                        huggingface-cli command helpers
    login               Log in using a token from
                        huggingface.co/settings/tokens
    whoami              Find out which huggingface.co account you are logged
                        in as.
    logout              Log out
    repo                {create, ls-files} Commands to interact with your
                        huggingface.co repos.
    lfs-enable-largefiles
                        Configure your repository to enable upload of files >
                        5GB.
    lfs-multipart-upload
                        Command will get called by git-lfs, do not call it
                        directly.

optional arguments:
  -h, --help            show this help message and exit

You can use huggingface-cli login to login

HF datasets are just git repos! You can clone a repo like this:

Datasets are Git repos

HF datasets are just git repos

_dir = remote_name.split('/')[-1]

!rm -rf {_dir}
!git clone 'https://huggingface.co/datasets/'{remote_name}
!ls {_dir}

Cloning into 'tabular-data-test'...

remote: Enumerating objects: 13, done.

remote: Counting objects: 100% (13/13), done.

remote: Compressing objects: 100% (12/12), done.

remote: Total 13 (delta 3), reused 0 (delta 0), pack-reused 0

Unpacking objects: 100% (13/13), 1.93 KiB | 164.00 KiB/s, done.

data               dataset_infos.json

The parquet file is here:

!ls {_dir}'/data'

train-00000-of-00001-646295d7cc3e7eab.parquet

Dataset Cards

You specify the dataset card by filling out the README.md file. In the Hub there is a README creation tool that has a template you can fill out.
There are tags for the dataset that you can set in the front matter of the README. This is an example. This application can help you generate the tags.

FAISS Semantic Search

See this lesson. HF datasets have really nice built-in tools to do semantic search. This is really useful and fun.