from datasets import load_dataset
= load_dataset("glue", "mrpc", split="train") dataset
Found cached dataset glue (/Users/hamel/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
These are some notes on the basics of working with HF datasets. These are very important if you want to fine tune LLMs because you will be downloading / uploading datasets from the Hub frequently.
dataset.map
does some kind of dict merge so `dataset.map(…) that emits a new dict key will add an additional field.features: ['output', 'instruction', 'input']
.ds = load_dataset("bigcode/the-stack", streaming=True, split="train")
batched=True
is a good way to speed things upFollowing notes on this page
Found cached dataset glue (/Users/hamel/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
'label': 1,
'idx': 0}
You will want to tokenize examples in this case
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
With just one input, the token_type_ids
are the same:
{'input_ids': [101, 7592, 2054, 2003, 2183, 2006, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}
With two inputs, the token_type_ids
are indexed accordingly:
{'input_ids': [101, 7592, 2054, 2003, 2183, 2006, 1029, 102, 1045, 2572, 2182, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
groups = [[], []]
for i,tt in zip(out['input_ids'], out['token_type_ids']):
groups[tt].append(i)
for g in groups:
print(tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(g)))
[CLS] hello what is going on? [SEP]
i am here. [SEP]
The quickstart says that the model requires the field name labels
. How would we know? We can look at the forward
method of the model:
Help on method forward in module transformers.models.bert.modeling_bert:
forward(input_ids: Optional[torch.Tensor] = None, attention_mask: Optional[torch.Tensor] = None, token_type_ids: Optional[torch.Tensor] = None, position_ids: Optional[torch.Tensor] = None, head_mask: Optional[torch.Tensor] = None, inputs_embeds: Optional[torch.Tensor] = None, labels: Optional[torch.Tensor] = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None) -> Union[Tuple[torch.Tensor], transformers.modeling_outputs.SequenceClassifierOutput] method of transformers.models.bert.modeling_bert.BertForSequenceClassification instance
The [`BertForSequenceClassification`] forward method, overrides the `__call__` special method.
<Tip>
Although the recipe for forward pass needs to be defined within this function, one should call the [`Module`]
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
</Tip>
Args:
input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
Indices of input sequence tokens in the vocabulary.
Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
[`PreTrainedTokenizer.__call__`] for details.
[What are input IDs?](../glossary#input-ids)
attention_mask (`torch.FloatTensor` of shape `(batch_size, sequence_length)`, *optional*):
Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
- 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**.
[What are attention masks?](../glossary#attention-mask)
token_type_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0,
1]`:
- 0 corresponds to a *sentence A* token,
- 1 corresponds to a *sentence B* token.
[What are token type IDs?](../glossary#token-type-ids)
position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
config.max_position_embeddings - 1]`.
[What are position IDs?](../glossary#position-ids)
head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`:
- 1 indicates the head is **not masked**,
- 0 indicates the head is **masked**.
inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
model's internal embedding lookup matrix.
output_attentions (`bool`, *optional*):
Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
tensors for more detail.
output_hidden_states (`bool`, *optional*):
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
more detail.
return_dict (`bool`, *optional*):
Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
`config.num_labels > 1` a classification loss is computed (Cross-Entropy).
Returns:
[`transformers.modeling_outputs.SequenceClassifierOutput`] or `tuple(torch.FloatTensor)`: A [`transformers.modeling_outputs.SequenceClassifierOutput`] or a tuple of
`torch.FloatTensor` (if `return_dict=False` is passed or when `config.return_dict=False`) comprising various
elements depending on the configuration ([`BertConfig`]) and inputs.
- **loss** (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided) -- Classification (or regression if config.num_labels==1) loss.
- **logits** (`torch.FloatTensor` of shape `(batch_size, config.num_labels)`) -- Classification (or regression if config.num_labels==1) scores (before SoftMax).
- **hidden_states** (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`) -- Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
- **attentions** (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`) -- Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
Example of single-label classification:
```python
>>> import torch
>>> from transformers import AutoTokenizer, BertForSequenceClassification
>>> tokenizer = AutoTokenizer.from_pretrained("textattack/bert-base-uncased-yelp-polarity")
>>> model = BertForSequenceClassification.from_pretrained("textattack/bert-base-uncased-yelp-polarity")
>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
>>> with torch.no_grad():
... logits = model(**inputs).logits
>>> predicted_class_id = logits.argmax().item()
>>> model.config.id2label[predicted_class_id]
'LABEL_1'
>>> # To train a model on `num_labels` classes, you can pass `num_labels=num_labels` to `.from_pretrained(...)`
>>> num_labels = len(model.config.id2label)
>>> model = BertForSequenceClassification.from_pretrained("textattack/bert-base-uncased-yelp-polarity", num_labels=num_labels)
>>> labels = torch.tensor([1])
>>> loss = model(**inputs, labels=labels).loss
>>> round(loss.item(), 2)
0.01
```
Example of multi-label classification:
```python
>>> import torch
>>> from transformers import AutoTokenizer, BertForSequenceClassification
>>> tokenizer = AutoTokenizer.from_pretrained("textattack/bert-base-uncased-yelp-polarity")
>>> model = BertForSequenceClassification.from_pretrained("textattack/bert-base-uncased-yelp-polarity", problem_type="multi_label_classification")
>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
>>> with torch.no_grad():
... logits = model(**inputs).logits
>>> predicted_class_ids = torch.arange(0, logits.shape[-1])[torch.sigmoid(logits).squeeze(dim=0) > 0.5]
>>> # To train a model on `num_labels` classes, you can pass `num_labels=num_labels` to `.from_pretrained(...)`
>>> num_labels = len(model.config.id2label)
>>> model = BertForSequenceClassification.from_pretrained(
... "textattack/bert-base-uncased-yelp-polarity", num_labels=num_labels, problem_type="multi_label_classification"
... )
>>> labels = torch.sum(
... torch.nn.functional.one_hot(predicted_class_ids[None, :].clone(), num_classes=num_labels), dim=1
... ).to(torch.float)
>>> loss = model(**inputs, labels=labels).loss
```
Change label
to labels
There seems to be many subsets. This is the page
' The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified\n Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike\n License.\n'
[{'text': ''},
{'text': ' = Homarus gammarus = \n'},
{'text': ''},
{'text': ' Homarus gammarus , known as the European lobster or common lobster , is a species of <unk> lobster from the eastern Atlantic Ocean , Mediterranean Sea and parts of the Black Sea . It is closely related to the American lobster , H. americanus . It may grow to a length of 60 cm ( 24 in ) and a mass of 6 kilograms ( 13 lb ) , and bears a conspicuous pair of claws . In life , the lobsters are blue , only becoming " lobster red " on cooking . Mating occurs in the summer , producing eggs which are carried by the females for up to a year before hatching into <unk> larvae . Homarus gammarus is a highly esteemed food , and is widely caught using lobster pots , mostly around the British Isles . \n'},
{'text': ''}]
You can load a dataset from csv, tsv, text, json, jsonl, dataframes
You can also point to a url
from datasets import load_dataset
ds = load_dataset("csv", data_files="https://github.com/datablist/sample-csv-files/raw/main/files/customers/customers-500000.zip")
Using custom data configuration default-6e1837ea838b9492
Found cached dataset csv (/Users/hamel/.cache/huggingface/datasets/csv/default-6e1837ea838b9492/0.0.0/652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a)
DatasetDict({
train: Dataset({
features: ['Index', 'Customer Id', 'First Name', 'Last Name', 'Company', 'City', 'Country', 'Phone 1', 'Phone 2', 'Email', 'Subscription Date', 'Website'],
num_rows: 500000
})
})
{'Index': 1,
'Customer Id': 'e685B8690f9fbce',
'First Name': 'Erik',
'Last Name': 'Little',
'Company': 'Blankenship PLC',
'City': 'Caitlynmouth',
'Country': 'Sao Tome and Principe',
'Phone 1': '457-542-6899',
'Phone 2': '055.415.2664x5425',
'Email': 'shanehester@campbell.org',
'Subscription Date': '2021-12-23',
'Website': 'https://wagner.com/'}
map
{'Index': 1,
'Customer Id': 'e685B8690f9fbce',
'First Name': 'Erik',
'Last Name': 'Little',
'Company': 'Blankenship PLC',
'City': 'Caitlynmouth',
'Country': 'Sao Tome and Principe',
'Phone 1': '457-542-6899',
'Phone 2': '055.415.2664x5425',
'Email': 'shanehester@campbell.org',
'Subscription Date': '2021-12-23',
'Website': 'https://wagner.com/',
'Full Name': 'Erik Little'}
batched=True
for map
You operate over a list instead of single items, this can usually speed things up a bit. The below example is significantly faster than the default.
per the docs:
list comprehensions are usually faster than executing the same code in a for loop, and we also gain some performance by accessing lots of elements at the same time instead of one by one.
Using Dataset.map() with batched=True will be essential to unlock the speed of the “fast” tokenizers
def fullnm_batched(d): return {'Full Name': [f + ' ' + l for f,l in zip(d['First Name'], d['Last Name'])]}
ds.map(fullnm_batched, batched=True)
DatasetDict({
train: Dataset({
features: ['Index', 'Customer Id', 'First Name', 'Last Name', 'Company', 'City', 'Country', 'Phone 1', 'Phone 2', 'Email', 'Subscription Date', 'Website', 'Full Name'],
num_rows: 500000
})
})
batched=True
speed testHF tokenizers can work with or without batch=True
, let’s see the difference, first let’s make a text field, let’s use a dataset with a larger text field:
tds = load_dataset("csv",
data_files='https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip',
delimiter="\t");
Using custom data configuration default-3340c354bf896b6f
Found cached dataset csv (/Users/hamel/.cache/huggingface/datasets/csv/default-3340c354bf896b6f/0.0.0/652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a)
'"I've tried a few antidepressants over the years (citalopram, fluoxetine, amitriptyline), but none of those helped with my depression, insomnia & anxiety. My doctor suggested and changed me onto 45mg mirtazapine and this medicine has saved my life. Thankfully I have had no side effects especially the most common - weight gain, I've actually lost alot of weight. I still have suicidal thoughts but mirtazapine has saved me."'
batched
CPU times: user 1min 21s, sys: 1.71 s, total: 1min 23s
Wall time: 1min 23s
DatasetDict({
train: Dataset({
features: ['Unnamed: 0', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 215063
})
})
batched
19 Seconds!
CPU times: user 1min 5s, sys: 1.18 s, total: 1min 6s
Wall time: 1min 6s
DatasetDict({
train: Dataset({
features: ['Unnamed: 0', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 215063
})
})
15.7s!
for values of num_proc other than 8, our tests showed that it was faster to use batched=True without that option. In general, we don’t recommend using Python multiprocessing for fast tokenizers with batched=True.
CPU times: user 911 ms, sys: 533 ms, total: 1.44 s
Wall time: 18.1 s
DatasetDict({
train: Dataset({
features: ['Unnamed: 0', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 215063
})
})
select
Good to see a preview of different rows
{'Index': [209712, 246986],
'Customer Id': ['fad0d3B75B73cd7', 'D75eCaeAc8C6BD6'],
'First Name': ['Jo', 'Judith'],
'Last Name': ['Pittman', 'Thomas'],
'Company': ['Pineda-Hobbs', 'Mcguire, Alvarado and Kennedy'],
'City': ['Traciestad', 'Palmerfort'],
'Country': ['Finland', 'Tonga'],
'Phone 1': ['001-086-011-7063', '+1-495-667-1061x21703'],
'Phone 2': ['853-679-2287x631', '589.777.0504'],
'Email': ['gsantos@stuart.biz', 'vchung@bowman.com'],
'Subscription Date': ['2020-08-04', '2021-08-14'],
'Website': ['https://www.bautista.com/', 'https://wilkerson.org/'],
'Full Name': ['Jo Pittman', 'Judith Thomas']}
unique
rename_column
filter
sort
{'Index': [491821, 170619, 212021],
'Customer Id': ['84C747dDFac8Dc7', '5886eaffEF8dc6D', 'B8a6cFab936Fb2A'],
'First Name': ['Aaron', 'Aaron', 'Aaron'],
'Last Name': ['Hull', 'Cain', 'Mays'],
'Company': ['Morrow Inc', 'Mccormick-Hardy', 'Hopkins-Larson'],
'City': ['West Charles', 'West Connie', 'Mccallchester'],
'Country': ['Netherlands', 'Vanuatu', 'Ecuador'],
'Primary Phone Number': ['670-796-3507',
'323-296-0014',
'(594)960-9651x17240'],
'Phone 2': ['001-917-832-0423x324',
'+1-551-114-3103x05351',
'996.174.5737x6442'],
'Email': ['ivan16@bender.org',
'shelley82@bender.org',
'qrhodes@stokes-larson.info'],
'Subscription Date': ['2020-05-28', '2021-04-11', '2022-03-19'],
'Website': ['http://carney-lawson.info/',
'http://www.wiggins.biz/',
'http://pugh.com/'],
'Full Name': ['Aaron Hull', 'Aaron Cain', 'Aaron Mays']}
set_format
seems to work in place:
Index | Customer Id | First Name | Last Name | Company | City | Country | Primary Phone Number | Phone 2 | Subscription Date | Website | Full Name | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | e685B8690f9fbce | Erik | Little | Blankenship PLC | Caitlynmouth | Sao Tome and Principe | 457-542-6899 | 055.415.2664x5425 | shanehester@campbell.org | 2021-12-23 | https://wagner.com/ | Erik Little |
1 | 2 | 6EDdBA3a2DFA7De | Yvonne | Shaw | Jensen and Sons | Janetfort | Palestinian Territory | 9610730173 | 531-482-3000x7085 | kleinluis@vang.com | 2021-01-01 | https://www.paul.org/ | Yvonne Shaw |
2 | 3 | b9Da13bedEc47de | Jeffery | Ibarra | Rose, Deleon and Sanders | Darlenebury | Albania | (840)539-1797x479 | 209-519-5817 | deckerjamie@bartlett.biz | 2020-03-30 | https://www.morgan-phelps.com/ | Jeffery Ibarra |
3 | 4 | 710D4dA2FAa96B5 | James | Walters | Kline and Sons | Donhaven | Bahrain | +1-985-596-1072x3040 | (528)734-8924x054 | dochoa@carey-morse.com | 2022-01-18 | https://brennan.com/ | James Walters |
4 | 5 | 3c44ed62d7BfEBC | Leslie | Snyder | Price, Mason and Doyle | Mossfort | Central African Republic | 812-016-9904x8231 | 254.631.9380 | darrylbarber@warren.org | 2020-01-25 | http://www.trujillo-sullivan.info/ | Leslie Snyder |
You can get a proper pandas dataframe like this:
🚨 Under the hood, Dataset.set_format() changes the return format for the dataset’s getitem() dunder method. This means that when we want to create a new object like train_df from a Dataset in the “pandas” format, we need to slice the whole dataset to obtain a pandas.DataFrame. You can verify for yourself that the type of drug_dataset[“train”] is Dataset, irrespective of the output format.
This is going the other direction df -> ds
Dataset({
features: ['Index', 'Customer Id', 'First Name', 'Last Name', 'Company', 'City', 'Country', 'Primary Phone Number', 'Phone 2', 'Email', 'Subscription Date', 'Website', 'Full Name'],
num_rows: 500000
})
{'Index': [1, 2],
'Customer Id': ['e685B8690f9fbce', '6EDdBA3a2DFA7De'],
'First Name': ['Erik', 'Yvonne'],
'Last Name': ['Little', 'Shaw'],
'Company': ['Blankenship PLC', 'Jensen and Sons'],
'City': ['Caitlynmouth', 'Janetfort'],
'Country': ['Sao Tome and Principe', 'Palestinian Territory'],
'Primary Phone Number': ['457-542-6899', '9610730173'],
'Phone 2': ['055.415.2664x5425', '531-482-3000x7085'],
'Email': ['shanehester@campbell.org', 'kleinluis@vang.com'],
'Subscription Date': ['2021-12-23', '2021-01-01'],
'Website': ['https://wagner.com/', 'https://www.paul.org/'],
'Full Name': ['Erik Little', 'Yvonne Shaw']}
Note you can reset the format at anytime:
train/test etc.
DatasetDict({
train: Dataset({
features: ['Index', 'Customer Id', 'First Name', 'Last Name', 'Company', 'City', 'Country', 'Primary Phone Number', 'Phone 2', 'Email', 'Subscription Date', 'Website', 'Full Name'],
num_rows: 400000
})
test: Dataset({
features: ['Index', 'Customer Id', 'First Name', 'Last Name', 'Company', 'City', 'Country', 'Primary Phone Number', 'Phone 2', 'Email', 'Subscription Date', 'Website', 'Full Name'],
num_rows: 100000
})
})
You can create new partitions without train_test_split
explicitly by creating a new group like this:
DatasetDict({
train: Dataset({
features: ['Index', 'Customer Id', 'First Name', 'Last Name', 'Company', 'City', 'Country', 'Primary Phone Number', 'Phone 2', 'Email', 'Subscription Date', 'Website', 'Full Name'],
num_rows: 320000
})
test: Dataset({
features: ['Index', 'Customer Id', 'First Name', 'Last Name', 'Company', 'City', 'Country', 'Primary Phone Number', 'Phone 2', 'Email', 'Subscription Date', 'Website', 'Full Name'],
num_rows: 100000
})
validation: Dataset({
features: ['Index', 'Customer Id', 'First Name', 'Last Name', 'Company', 'City', 'Country', 'Primary Phone Number', 'Phone 2', 'Email', 'Subscription Date', 'Website', 'Full Name'],
num_rows: 80000
})
})
Let’s save our ds dataset to disk:
Dataset({
features: ['Index', 'Customer Id', 'First Name', 'Last Name', 'Company', 'City', 'Country', 'Primary Phone Number', 'Phone 2', 'Email', 'Subscription Date', 'Website', 'Full Name'],
num_rows: 500000
})
tabular_data
├── dataset.arrow
├── dataset_dict.json
├── dataset_info.json
├── state.json
└── train
├── dataset.arrow
├── dataset_info.json
└── state.json
1 directory, 7 files
Load the data now from disk
dataset
When you set streaming=True
you are returned a IterableDataset
object.
datasets.iterable_dataset.IterableDataset
take
and skip
These are special methods for IterableDataset
, these will not work for a regular dataset
[{'text': ''},
{'text': ' = Homarus gammarus = \n'},
{'text': ''},
{'text': ' Homarus gammarus , known as the European lobster or common lobster , is a species of <unk> lobster from the eastern Atlantic Ocean , Mediterranean Sea and parts of the Black Sea . It is closely related to the American lobster , H. americanus . It may grow to a length of 60 cm ( 24 in ) and a mass of 6 kilograms ( 13 lb ) , and bears a conspicuous pair of claws . In life , the lobsters are blue , only becoming " lobster red " on cooking . Mating occurs in the summer , producing eggs which are carried by the females for up to a year before hatching into <unk> larvae . Homarus gammarus is a highly esteemed food , and is widely caught using lobster pots , mostly around the British Isles . \n'}]
you can use itertools.islice
to get multiple items:
The old way looks like this:
Found cached dataset wikitext (/Users/hamel/.cache/huggingface/datasets/wikitext/wikitext-2-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126)
datasets.arrow_dataset.Dataset
See the docs
usage: huggingface-cli <command> [<args>]
positional arguments:
{login,whoami,logout,repo,lfs-enable-largefiles,lfs-multipart-upload}
huggingface-cli command helpers
login Log in using a token from
huggingface.co/settings/tokens
whoami Find out which huggingface.co account you are logged
in as.
logout Log out
repo {create, ls-files} Commands to interact with your
huggingface.co repos.
lfs-enable-largefiles
Configure your repository to enable upload of files >
5GB.
lfs-multipart-upload
Command will get called by git-lfs, do not call it
directly.
optional arguments:
-h, --help show this help message and exit
You can use huggingface-cli login
to login
HF datasets are just git repos! You can clone a repo like this:
HF datasets are just git repos
_dir = remote_name.split('/')[-1]
!rm -rf {_dir}
!git clone 'https://huggingface.co/datasets/'{remote_name}
!ls {_dir}
Cloning into 'tabular-data-test'...
remote: Enumerating objects: 13, done.
remote: Counting objects: 100% (13/13), done.
remote: Compressing objects: 100% (12/12), done.
remote: Total 13 (delta 3), reused 0 (delta 0), pack-reused 0
Unpacking objects: 100% (13/13), 1.93 KiB | 164.00 KiB/s, done.
data dataset_infos.json
The parquet file is here:
README.md
file. In the Hub there is a README creation tool that has a template you can fill out.See this lesson. HF datasets have really nice built-in tools to do semantic search. This is really useful and fun.