Data

fastai
Notes on the the DataBlock api.
from fastbook import *

Data In fastai

One of the most important things in fastai to understand is how you prepare your data for a model. The main workhorse for accomplishing this in fastai is the DataBlock api. Here is a hello world example of how this works:

Hello World DataBlock

The argument get_x and get_y operate on an iterable. Let’s define an interable as our data:

data = list(range(100))
def get_x(r): return r
def get_y(r): return r + 10
dblock = DataBlock(get_x=get_x, get_y = get_y)
dsets = dblock.datasets(data)

You can see a dataset like so:

dsets.train[0]
(89, 99)

You can also see a DataLoader like so:

dls = dblock.dataloaders(data, bs=5)
next(iter(dls.train))
(tensor([57, 66, 73, 30, 14]), tensor([67, 76, 83, 40, 24]))

With A DataFrame

Similarly, you can operate on one row at a time:

import pandas as pd
df = pd.DataFrame({'x': range(100), 'y': range(100) })
df.head()
x y
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
def get_x(r): return r.x
def get_y(r): return r.y + 10
dblock = DataBlock(get_x=get_x, get_y=get_y)
dsets = dblock.datasets(df)
dsets.train[0]
(78, 88)
dls = dblock.dataloaders(df, bs=3)
next(iter(dls.train))
(tensor([90, 55, 11]), tensor([100,  65,  21]))
def tracer(nm):
    def f(x, nm):
        # print(f'{nm}:')
        # print(f'\tinput: {x}')
        # import ipdb; ipdb.set_trace()
        return str(x)
    return partial(f, nm=nm)
def mult_0(x): return x * 0
def add_1(x): return x +1 
tb = TransformBlock(item_tfms=[tracer('item_tfms')])
# def get_y(l): return sum(l)
db = DataBlock(blocks=(TransformBlock, TransformBlock),
               get_x=mult_0,
               get_y=add_1,
               item_tfms=lambda x: str(x))
data = L(range(10))
result = db.datasets(data)
db.summary(data)
Setting-up type transforms pipelines
Collecting items from [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Found 10 items
2 datasets of sizes 8,2
Setting up Pipeline: mult_0
Setting up Pipeline: add_1

Building one sample
  Pipeline: mult_0
    starting from
      1
    applying mult_0 gives
      0
  Pipeline: add_1
    starting from
      1
    applying add_1 gives
      2

Final sample: (0, 2)


Collecting items from [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Found 10 items
2 datasets of sizes 8,2
Setting up Pipeline: mult_0
Setting up Pipeline: add_1
Setting up after_item: Pipeline: <lambda> -> ToTensor
Setting up before_batch: Pipeline: 
Setting up after_batch: Pipeline: 

Building one batch
Applying item_tfms to the first sample:
  Pipeline: <lambda> -> ToTensor
    starting from
      (0, 2)
    applying <lambda> gives
      (0, 2)
    applying ToTensor gives
      (0, 2)

Adding the next 3 samples

No before_batch transform to apply

Collating items in a batch

No batch_tfms to apply
result.train[0]
(0, 5)
result = db.dataloaders(data, bs=3)
thing = iter(result.train)
next(thing)
(('0', '0', '0'), ('6', '7', '4'))
next(thing)
(('0', '0', '0'), ('9', '5', '3'))
??TransformBlock
db = DataBlock(blocks=(TransformBlock, tb),
              get_y=lambda x: str(x),
              batch_tfms=tracer('batch_tfms'))
result = db.datasets(data)
result = db.dataloaders(data, bs=3)
result
<fastai.data.core.DataLoaders>
thing = iter(result.train)
next(thing)
(('1', '5', '6'), ('1', '5', '6'))
f = aug_transforms()[0]
f
Flip -- {'size': None, 'mode': 'bilinear', 'pad_mode': 'reflection', 'mode_mask': 'nearest', 'align_corners': True, 'p': 0.5}:
encodes: (TensorImage,object) -> encodes
(TensorMask,object) -> encodes
(TensorBBox,object) -> encodes
(TensorPoint,object) -> encodes
decodes: