from datasets import load_dataset
= load_dataset("glue", "mrpc", split="train") dataset
Found cached dataset glue (/Users/hamel/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
Subscribe for updates
These are some notes on the basics of working with HF datasets. These are very important if you want to fine tune LLMs because you will be downloading / uploading datasets from the Hub frequently.
dataset.map
does some kind of dict merge so `dataset.map(…) that emits a new dict key will add an additional field.features: ['output', 'instruction', 'input']
.ds = load_dataset("bigcode/the-stack", streaming=True, split="train")
batched=True
is a good way to speed things upFollowing notes on this page
Found cached dataset glue (/Users/hamel/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
'label': 1,
'idx': 0}
You will want to tokenize examples in this case
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
With just one input, the token_type_ids
are the same:
{'input_ids': [101, 7592, 2054, 2003, 2183, 2006, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}
With two inputs, the token_type_ids
are indexed accordingly:
{'input_ids': [101, 7592, 2054, 2003, 2183, 2006, 1029, 102, 1045, 2572, 2182, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
groups = [[], []]
for i,tt in zip(out['input_ids'], out['token_type_ids']):
groups[tt].append(i)
for g in groups:
print(tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(g)))
[CLS] hello what is going on? [SEP]
i am here. [SEP]
The quickstart says that the model requires the field name labels
. How would we know? We can look at the forward
method of the model:
Help on method forward in module transformers.models.bert.modeling_bert:
forward(input_ids: Optional[torch.Tensor] = None, attention_mask: Optional[torch.Tensor] = None, token_type_ids: Optional[torch.Tensor] = None, position_ids: Optional[torch.Tensor] = None, head_mask: Optional[torch.Tensor] = None, inputs_embeds: Optional[torch.Tensor] = None, labels: Optional[torch.Tensor] = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None) -> Union[Tuple[torch.Tensor], transformers.modeling_outputs.SequenceClassifierOutput] method of transformers.models.bert.modeling_bert.BertForSequenceClassification instance
The [`BertForSequenceClassification`] forward method, overrides the `__call__` special method.
<Tip>
Although the recipe for forward pass needs to be defined within this function, one should call the [`Module`]
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
</Tip>
Args:
input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
Indices of input sequence tokens in the vocabulary.
Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
[`PreTrainedTokenizer.__call__`] for details.
[What are input IDs?](../glossary#input-ids)
attention_mask (`torch.FloatTensor` of shape `(batch_size, sequence_length)`, *optional*):
Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
- 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**.
[What are attention masks?](../glossary#attention-mask)
token_type_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0,
1]`:
- 0 corresponds to a *sentence A* token,
- 1 corresponds to a *sentence B* token.
[What are token type IDs?](../glossary#token-type-ids)
position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
config.max_position_embeddings - 1]`.
[What are position IDs?](../glossary#position-ids)
head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`:
- 1 indicates the head is **not masked**,
- 0 indicates the head is **masked**.
inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
model's internal embedding lookup matrix.
output_attentions (`bool`, *optional*):
Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
tensors for more detail.
output_hidden_states (`bool`, *optional*):
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
more detail.
return_dict (`bool`, *optional*):
Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
`config.num_labels > 1` a classification loss is computed (Cross-Entropy).
Returns:
[`transformers.modeling_outputs.SequenceClassifierOutput`] or `tuple(torch.FloatTensor)`: A [`transformers.modeling_outputs.SequenceClassifierOutput`] or a tuple of
`torch.FloatTensor` (if `return_dict=False` is passed or when `config.return_dict=False`) comprising various
elements depending on the configuration ([`BertConfig`]) and inputs.
- **loss** (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided) -- Classification (or regression if config.num_labels==1) loss.
- **logits** (`torch.FloatTensor` of shape `(batch_size, config.num_labels)`) -- Classification (or regression if config.num_labels==1) scores (before SoftMax).
- **hidden_states** (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`) -- Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
- **attentions** (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`) -- Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
Example of single-label classification:
```python
>>> import torch
>>> from transformers import AutoTokenizer, BertForSequenceClassification
>>> tokenizer = AutoTokenizer.from_pretrained("textattack/bert-base-uncased-yelp-polarity")
>>> model = BertForSequenceClassification.from_pretrained("textattack/bert-base-uncased-yelp-polarity")
>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
>>> with torch.no_grad():
... logits = model(**inputs).logits
>>> predicted_class_id = logits.argmax().item()
>>> model.config.id2label[predicted_class_id]
'LABEL_1'
>>> # To train a model on `num_labels` classes, you can pass `num_labels=num_labels` to `.from_pretrained(...)`
>>> num_labels = len(model.config.id2label)
>>> model = BertForSequenceClassification.from_pretrained("textattack/bert-base-uncased-yelp-polarity", num_labels=num_labels)
>>> labels = torch.tensor([1])
>>> loss = model(**inputs, labels=labels).loss
>>> round(loss.item(), 2)
0.01
```
Example of multi-label classification:
```python
>>> import torch
>>> from transformers import AutoTokenizer, BertForSequenceClassification
>>> tokenizer = AutoTokenizer.from_pretrained("textattack/bert-base-uncased-yelp-polarity")
>>> model = BertForSequenceClassification.from_pretrained("textattack/bert-base-uncased-yelp-polarity", problem_type="multi_label_classification")
>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
>>> with torch.no_grad():
... logits = model(**inputs).logits
>>> predicted_class_ids = torch.arange(0, logits.shape[-1])[torch.sigmoid(logits).squeeze(dim=0) > 0.5]
>>> # To train a model on `num_labels` classes, you can pass `num_labels=num_labels` to `.from_pretrained(...)`
>>> num_labels = len(model.config.id2label)
>>> model = BertForSequenceClassification.from_pretrained(
... "textattack/bert-base-uncased-yelp-polarity", num_labels=num_labels, problem_type="multi_label_classification"
... )
>>> labels = torch.sum(
... torch.nn.functional.one_hot(predicted_class_ids[None, :].clone(), num_classes=num_labels), dim=1
... ).to(torch.float)
>>> loss = model(**inputs, labels=labels).loss
```
Change label
to labels
There seems to be many subsets. This is the page
' The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified\n Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike\n License.\n'
[{'text': ''},
{'text': ' = Homarus gammarus = \n'},
{'text': ''},
{'text': ' Homarus gammarus , known as the European lobster or common lobster , is a species of <unk> lobster from the eastern Atlantic Ocean , Mediterranean Sea and parts of the Black Sea . It is closely related to the American lobster , H. americanus . It may grow to a length of 60 cm ( 24 in ) and a mass of 6 kilograms ( 13 lb ) , and bears a conspicuous pair of claws . In life , the lobsters are blue , only becoming " lobster red " on cooking . Mating occurs in the summer , producing eggs which are carried by the females for up to a year before hatching into <unk> larvae . Homarus gammarus is a highly esteemed food , and is widely caught using lobster pots , mostly around the British Isles . \n'},
{'text': ''}]
You can load a dataset from csv, tsv, text, json, jsonl, dataframes
You can also point to a url
from datasets import load_dataset
ds = load_dataset("csv", data_files="https://github.com/datablist/sample-csv-files/raw/main/files/customers/customers-500000.zip")
Using custom data configuration default-6e1837ea838b9492
Found cached dataset csv (/Users/hamel/.cache/huggingface/datasets/csv/default-6e1837ea838b9492/0.0.0/652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a)
DatasetDict({
train: Dataset({
features: ['Index', 'Customer Id', 'First Name', 'Last Name', 'Company', 'City', 'Country', 'Phone 1', 'Phone 2', 'Email', 'Subscription Date', 'Website'],
num_rows: 500000
})
})
{'Index': 1,
'Customer Id': 'e685B8690f9fbce',
'First Name': 'Erik',
'Last Name': 'Little',
'Company': 'Blankenship PLC',
'City': 'Caitlynmouth',
'Country': 'Sao Tome and Principe',
'Phone 1': '457-542-6899',
'Phone 2': '055.415.2664x5425',
'Email': 'shanehester@campbell.org',
'Subscription Date': '2021-12-23',
'Website': 'https://wagner.com/'}
map
{'Index': 1,
'Customer Id': 'e685B8690f9fbce',
'First Name': 'Erik',
'Last Name': 'Little',
'Company': 'Blankenship PLC',
'City': 'Caitlynmouth',
'Country': 'Sao Tome and Principe',
'Phone 1': '457-542-6899',
'Phone 2': '055.415.2664x5425',
'Email': 'shanehester@campbell.org',
'Subscription Date': '2021-12-23',
'Website': 'https://wagner.com/',
'Full Name': 'Erik Little'}
batched=True
for map
You operate over a list instead of single items, this can usually speed things up a bit. The below example is significantly faster than the default.
per the docs:
list comprehensions are usually faster than executing the same code in a for loop, and we also gain some performance by accessing lots of elements at the same time instead of one by one.
Using Dataset.map() with batched=True will be essential to unlock the speed of the “fast” tokenizers
def fullnm_batched(d): return {'Full Name': [f + ' ' + l for f,l in zip(d['First Name'], d['Last Name'])]}
ds.map(fullnm_batched, batched=True)
DatasetDict({
train: Dataset({
features: ['Index', 'Customer Id', 'First Name', 'Last Name', 'Company', 'City', 'Country', 'Phone 1', 'Phone 2', 'Email', 'Subscription Date', 'Website', 'Full Name'],
num_rows: 500000
})
})
batched=True
speed testHF tokenizers can work with or without batch=True
, let’s see the difference, first let’s make a text field, let’s use a dataset with a larger text field:
tds = load_dataset("csv",
data_files='https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip',
delimiter="\t");
Using custom data configuration default-3340c354bf896b6f
Found cached dataset csv (/Users/hamel/.cache/huggingface/datasets/csv/default-3340c354bf896b6f/0.0.0/652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a)
'"I've tried a few antidepressants over the years (citalopram, fluoxetine, amitriptyline), but none of those helped with my depression, insomnia & anxiety. My doctor suggested and changed me onto 45mg mirtazapine and this medicine has saved my life. Thankfully I have had no side effects especially the most common - weight gain, I've actually lost alot of weight. I still have suicidal thoughts but mirtazapine has saved me."'
batched
CPU times: user 1min 21s, sys: 1.71 s, total: 1min 23s
Wall time: 1min 23s
DatasetDict({
train: Dataset({
features: ['Unnamed: 0', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 215063
})
})
batched
19 Seconds!
CPU times: user 1min 5s, sys: 1.18 s, total: 1min 6s
Wall time: 1min 6s
DatasetDict({
train: Dataset({
features: ['Unnamed: 0', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 215063
})
})
15.7s!
for values of num_proc other than 8, our tests showed that it was faster to use batched=True without that option. In general, we don’t recommend using Python multiprocessing for fast tokenizers with batched=True.
CPU times: user 911 ms, sys: 533 ms, total: 1.44 s
Wall time: 18.1 s
DatasetDict({
train: Dataset({
features: ['Unnamed: 0', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 215063
})
})
select
Good to see a preview of different rows
{'Index': [209712, 246986],
'Customer Id': ['fad0d3B75B73cd7', 'D75eCaeAc8C6BD6'],
'First Name': ['Jo', 'Judith'],
'Last Name': ['Pittman', 'Thomas'],
'Company': ['Pineda-Hobbs', 'Mcguire, Alvarado and Kennedy'],
'City': ['Traciestad', 'Palmerfort'],
'Country': ['Finland', 'Tonga'],
'Phone 1': ['001-086-011-7063', '+1-495-667-1061x21703'],
'Phone 2': ['853-679-2287x631', '589.777.0504'],
'Email': ['gsantos@stuart.biz', 'vchung@bowman.com'],
'Subscription Date': ['2020-08-04', '2021-08-14'],
'Website': ['https://www.bautista.com/', 'https://wilkerson.org/'],
'Full Name': ['Jo Pittman', 'Judith Thomas']}
unique
rename_column
filter
sort
{'Index': [491821, 170619, 212021],
'Customer Id': ['84C747dDFac8Dc7', '5886eaffEF8dc6D', 'B8a6cFab936Fb2A'],
'First Name': ['Aaron', 'Aaron', 'Aaron'],
'Last Name': ['Hull', 'Cain', 'Mays'],
'Company': ['Morrow Inc', 'Mccormick-Hardy', 'Hopkins-Larson'],
'City': ['West Charles', 'West Connie', 'Mccallchester'],
'Country': ['Netherlands', 'Vanuatu', 'Ecuador'],
'Primary Phone Number': ['670-796-3507',
'323-296-0014',
'(594)960-9651x17240'],
'Phone 2': ['001-917-832-0423x324',
'+1-551-114-3103x05351',
'996.174.5737x6442'],
'Email': ['ivan16@bender.org',
'shelley82@bender.org',
'qrhodes@stokes-larson.info'],
'Subscription Date': ['2020-05-28', '2021-04-11', '2022-03-19'],
'Website': ['http://carney-lawson.info/',
'http://www.wiggins.biz/',
'http://pugh.com/'],
'Full Name': ['Aaron Hull', 'Aaron Cain', 'Aaron Mays']}
set_format
seems to work in place:
Index | Customer Id | First Name | Last Name | Company | City | Country | Primary Phone Number | Phone 2 | Subscription Date | Website | Full Name | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | e685B8690f9fbce | Erik | Little | Blankenship PLC | Caitlynmouth | Sao Tome and Principe | 457-542-6899 | 055.415.2664x5425 | shanehester@campbell.org | 2021-12-23 | https://wagner.com/ | Erik Little |
1 | 2 | 6EDdBA3a2DFA7De | Yvonne | Shaw | Jensen and Sons | Janetfort | Palestinian Territory | 9610730173 | 531-482-3000x7085 | kleinluis@vang.com | 2021-01-01 | https://www.paul.org/ | Yvonne Shaw |
2 | 3 | b9Da13bedEc47de | Jeffery | Ibarra | Rose, Deleon and Sanders | Darlenebury | Albania | (840)539-1797x479 | 209-519-5817 | deckerjamie@bartlett.biz | 2020-03-30 | https://www.morgan-phelps.com/ | Jeffery Ibarra |
3 | 4 | 710D4dA2FAa96B5 | James | Walters | Kline and Sons | Donhaven | Bahrain | +1-985-596-1072x3040 | (528)734-8924x054 | dochoa@carey-morse.com | 2022-01-18 | https://brennan.com/ | James Walters |
4 | 5 | 3c44ed62d7BfEBC | Leslie | Snyder | Price, Mason and Doyle | Mossfort | Central African Republic | 812-016-9904x8231 | 254.631.9380 | darrylbarber@warren.org | 2020-01-25 | http://www.trujillo-sullivan.info/ | Leslie Snyder |
You can get a proper pandas dataframe like this:
🚨 Under the hood, Dataset.set_format() changes the return format for the dataset’s getitem() dunder method. This means that when we want to create a new object like train_df from a Dataset in the “pandas” format, we need to slice the whole dataset to obtain a pandas.DataFrame. You can verify for yourself that the type of drug_dataset[“train”] is Dataset, irrespective of the output format.
This is going the other direction df -> ds
Dataset({
features: ['Index', 'Customer Id', 'First Name', 'Last Name', 'Company', 'City', 'Country', 'Primary Phone Number', 'Phone 2', 'Email', 'Subscription Date', 'Website', 'Full Name'],
num_rows: 500000
})
{'Index': [1, 2],
'Customer Id': ['e685B8690f9fbce', '6EDdBA3a2DFA7De'],
'First Name': ['Erik', 'Yvonne'],
'Last Name': ['Little', 'Shaw'],
'Company': ['Blankenship PLC', 'Jensen and Sons'],
'City': ['Caitlynmouth', 'Janetfort'],
'Country': ['Sao Tome and Principe', 'Palestinian Territory'],
'Primary Phone Number': ['457-542-6899', '9610730173'],
'Phone 2': ['055.415.2664x5425', '531-482-3000x7085'],
'Email': ['shanehester@campbell.org', 'kleinluis@vang.com'],
'Subscription Date': ['2021-12-23', '2021-01-01'],
'Website': ['https://wagner.com/', 'https://www.paul.org/'],
'Full Name': ['Erik Little', 'Yvonne Shaw']}
Note you can reset the format at anytime:
train/test etc.
DatasetDict({
train: Dataset({
features: ['Index', 'Customer Id', 'First Name', 'Last Name', 'Company', 'City', 'Country', 'Primary Phone Number', 'Phone 2', 'Email', 'Subscription Date', 'Website', 'Full Name'],
num_rows: 400000
})
test: Dataset({
features: ['Index', 'Customer Id', 'First Name', 'Last Name', 'Company', 'City', 'Country', 'Primary Phone Number', 'Phone 2', 'Email', 'Subscription Date', 'Website', 'Full Name'],
num_rows: 100000
})
})
You can create new partitions without train_test_split
explicitly by creating a new group like this:
DatasetDict({
train: Dataset({
features: ['Index', 'Customer Id', 'First Name', 'Last Name', 'Company', 'City', 'Country', 'Primary Phone Number', 'Phone 2', 'Email', 'Subscription Date', 'Website', 'Full Name'],
num_rows: 320000
})
test: Dataset({
features: ['Index', 'Customer Id', 'First Name', 'Last Name', 'Company', 'City', 'Country', 'Primary Phone Number', 'Phone 2', 'Email', 'Subscription Date', 'Website', 'Full Name'],
num_rows: 100000
})
validation: Dataset({
features: ['Index', 'Customer Id', 'First Name', 'Last Name', 'Company', 'City', 'Country', 'Primary Phone Number', 'Phone 2', 'Email', 'Subscription Date', 'Website', 'Full Name'],
num_rows: 80000
})
})
Let’s save our ds dataset to disk:
Dataset({
features: ['Index', 'Customer Id', 'First Name', 'Last Name', 'Company', 'City', 'Country', 'Primary Phone Number', 'Phone 2', 'Email', 'Subscription Date', 'Website', 'Full Name'],
num_rows: 500000
})
tabular_data
├── dataset.arrow
├── dataset_dict.json
├── dataset_info.json
├── state.json
└── train
├── dataset.arrow
├── dataset_info.json
└── state.json
1 directory, 7 files
Load the data now from disk
dataset
When you set streaming=True
you are returned a IterableDataset
object.
datasets.iterable_dataset.IterableDataset
take
and skip
These are special methods for IterableDataset
, these will not work for a regular dataset
[{'text': ''},
{'text': ' = Homarus gammarus = \n'},
{'text': ''},
{'text': ' Homarus gammarus , known as the European lobster or common lobster , is a species of <unk> lobster from the eastern Atlantic Ocean , Mediterranean Sea and parts of the Black Sea . It is closely related to the American lobster , H. americanus . It may grow to a length of 60 cm ( 24 in ) and a mass of 6 kilograms ( 13 lb ) , and bears a conspicuous pair of claws . In life , the lobsters are blue , only becoming " lobster red " on cooking . Mating occurs in the summer , producing eggs which are carried by the females for up to a year before hatching into <unk> larvae . Homarus gammarus is a highly esteemed food , and is widely caught using lobster pots , mostly around the British Isles . \n'}]
you can use itertools.islice
to get multiple items:
The old way looks like this:
Found cached dataset wikitext (/Users/hamel/.cache/huggingface/datasets/wikitext/wikitext-2-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126)
datasets.arrow_dataset.Dataset
See the docs
usage: huggingface-cli <command> [<args>]
positional arguments:
{login,whoami,logout,repo,lfs-enable-largefiles,lfs-multipart-upload}
huggingface-cli command helpers
login Log in using a token from
huggingface.co/settings/tokens
whoami Find out which huggingface.co account you are logged
in as.
logout Log out
repo {create, ls-files} Commands to interact with your
huggingface.co repos.
lfs-enable-largefiles
Configure your repository to enable upload of files >
5GB.
lfs-multipart-upload
Command will get called by git-lfs, do not call it
directly.
optional arguments:
-h, --help show this help message and exit
You can use huggingface-cli login
to login
HF datasets are just git repos! You can clone a repo like this:
HF datasets are just git repos
_dir = remote_name.split('/')[-1]
!rm -rf {_dir}
!git clone 'https://huggingface.co/datasets/'{remote_name}
!ls {_dir}
Cloning into 'tabular-data-test'...
remote: Enumerating objects: 13, done.
remote: Counting objects: 100% (13/13), done.
remote: Compressing objects: 100% (12/12), done.
remote: Total 13 (delta 3), reused 0 (delta 0), pack-reused 0
Unpacking objects: 100% (13/13), 1.93 KiB | 164.00 KiB/s, done.
data dataset_infos.json
The parquet file is here:
README.md
file. In the Hub there is a README creation tool that has a template you can fill out.See this lesson. HF datasets have really nice built-in tools to do semantic search. This is really useful and fun.