Data containers

This tutorial explains what data containers are, how they are used in FastAI.jl and how to create your own. You are encouraged to follow along in a REPL or a Jupyter notebook and explore the code. You will find small exercises at the end of some sections to deepen your understanding.

Introduction

In the quickstart section, you have already come in contact with data containers. The following code was used to load a data container for image classification:

using FastAI
using FastAI.Datasets
using FastAI.Datasets: datasetpath, loadtaskdata

NAME = "imagenette2-160"
dir = datasetpath(NAME)
data = loadtaskdata(dir, ImageClassificationTask)

mapobs((input = FastAI.Datasets.loadfile, target = FastAI.Datasets.var"#27#32"()), DataSubset(::FastAI.Datasets.FileDataset, ::Vector{Int64}, ObsDim.Undefined())
 13394 observations)

A data container is any type that holds observations of data and allows us to load them with getobs and query the number of observations with nobs:

obs = getobs(data, 1)

(input = ColorTypes.RGB{FixedPointNumbers.N0f8}[RGB{N0f8}(0.451,0.424,0.314) RGB{N0f8}(0.6,0.569,0.478) … RGB{N0f8}(0.773,0.824,0.784) RGB{N0f8}(0.753,0.804,0.765); RGB{N0f8}(0.584,0.557,0.447) RGB{N0f8}(0.745,0.714,0.624) … RGB{N0f8}(0.71,0.761,0.722) RGB{N0f8}(0.678,0.729,0.69); … ; RGB{N0f8}(0.525,0.627,0.624) RGB{N0f8}(0.498,0.612,0.596) … RGB{N0f8}(0.506,0.553,0.459) RGB{N0f8}(0.392,0.439,0.345); RGB{N0f8}(0.396,0.498,0.486) RGB{N0f8}(0.431,0.533,0.522) … RGB{N0f8}(0.361,0.412,0.294) RGB{N0f8}(0.353,0.408,0.278)], target = "n01440764")

nobs(data)

In this case, each observation is a tuple of an image and the corresponding class; after all, we want to use it for image classification.

image, class = obs
@show class
image

class = "n01440764"

As you saw above, the Datasets submodule provides functions for loading and creating data containers. We used Datasets.datasetpath to download a dataset if it wasn’t yet and get the folder it was downloaded to. Then, Datasets.loadtaskdata took the folder and loaded a data container suitable for image classification. FastAI.jl makes it easy to download the datasets from fastai’s collection on AWS Open Datasets. For the full list, see Datasets.DATASETS

Exercises

Have a look at the other image classification datasets in Datasets.DATASETS_IMAGECLASSIFICATION and change the above code to load a different dataset.

Creating data containers from files

loadtaskdata makes it easy to get started when your dataset already comes in the correct format, but alas, datasets come in all different shapes and sizes. Let’s create the same data container, but now using more general functions FastAI.jl provides to get a look behind the scenes. If each observation in your dataset is a file in a folder, FileDataset conveniently creates a data container given a path. We’ll use the path of the downloaded dataset:

using FastAI.Datasets: FileDataset

filedata = FileDataset(dir)

FileDataset("/home/runner/.julia/datadeps/fastai-imagenette2-160/imagenette2-160", 13397 observations)

filedata is a data container where each observation is a path to a file. We’ll confirm that using getobs:

p = getobs(filedata, 100)

p"/home/runner/.julia/datadeps/fastai-imagenette2-160/imagenette2-160/train/n01440764/n01440764_10847.JPEG"

Next we need to load an image and the corresponding class from the path. If you have a look at the folder structure of dir you can see that the parent folder of each file gives the name of class. So we can use the following function to load the (image, class) pair from a path:

using FastAI.Datasets: loadfile, filename

function loadimageclass(p)
    return (
        Datasets.loadfile(p),
        filename(parent(p)),
    )
end

image, class = loadimageclass(p)
@show class
image

class = "n01440764"

Finally, we use mapobs to lazily transform each observation and have a data container ready to be used for training an image classifier.

data = mapobs(loadimageclass, filedata);

mapobs(loadimageclass, FileDataset("/home/runner/.julia/datadeps/fastai-imagenette2-160/imagenette2-160", 13397 observations))

Exercises

Using mapobs and loadfile, create a data container where every observation is only an image.

Splitting a data container into subsets

Until now, we’ve only created a single data container containing all observations in a dataset. In practice, though, you’ll want to have at least a training and validation split. The easiest way to get these is to randomly split your data container into two parts. Here we split data into 80% training and 20% validation data. Note the use of shuffleobs to make sure each split has approximately the same class distribution.

traindata, valdata = splitobs(shuffleobs(data), at = 0.8);

(DataSubset(::FastAI.Datasets.MappedData, view(::Vector{Int64}, 1:10718), ObsDim.Undefined())
 10718 observations, DataSubset(::FastAI.Datasets.MappedData, view(::Vector{Int64}, 10719:13397), ObsDim.Undefined())
 2679 observations)

This is great for experimenting, but where possible you will want to use the official training/validation split for a dataset. Consider the image classification dataset folder structure:

- $dir
    - train
        - class1
            - image1.jpg
            - image2.jpg
            - ...
        - class2
        - ...
    - valid
        - class1
        - class2
        - ...

As you can see, the grandparent folder of each image indicates which split it is a part of. groupobs allows us to partition a data container using a function. Let’s use it to split filedata based on the name of the grandparent directory. (We can’t reuse data for this since it no longer carries the file information.)

trainfiledata, validfiledata = groupobs(filedata) do p
    filename(parent(parent(p)))
end
nobs(trainfiledata), nobs(validfiledata)

(2, 1)

Using this official split, it will be easier to compare the performance of your results with those of others’.

Tutorials

Learning tasks

How To

Reference

Background

Data containers

Introduction

Exercises

Creating data containers from files

Exercises

Splitting a data container into subsets