Data containers
This tutorial explains what data containers are, how they are used in FastAI.jl and how to create your own. You are encouraged to follow along in a REPL or a Jupyter notebook and explore the code. You will find small exercises at the end of some sections to deepen your understanding.
Introduction
In the quickstart section, you have already come in contact with data containers. The following code was used to load a data container for image classification:
using FastAI
using FastAI.Datasets
using FastAI.Datasets: datasetpath, loadtaskdata
NAME = "imagenette2-160"
dir = datasetpath(NAME)
data = loadtaskdata(dir, ImageClassificationTask)
mapobs((input = FastAI.Datasets.loadfile, target = FastAI.Datasets.var"#27#32"()), DataSubset(::FastAI.Datasets.FileDataset, ::Vector{Int64}, ObsDim.Undefined())
13394 observations)
A data container is any type that holds observations of data and allows us to load them with getobs
and query the number of observations with nobs
:
obs = getobs(data, 1)
(input = ColorTypes.RGB{FixedPointNumbers.N0f8}[RGB{N0f8}(0.451,0.424,0.314) RGB{N0f8}(0.6,0.569,0.478) … RGB{N0f8}(0.773,0.824,0.784) RGB{N0f8}(0.753,0.804,0.765); RGB{N0f8}(0.584,0.557,0.447) RGB{N0f8}(0.745,0.714,0.624) … RGB{N0f8}(0.71,0.761,0.722) RGB{N0f8}(0.678,0.729,0.69); … ; RGB{N0f8}(0.525,0.627,0.624) RGB{N0f8}(0.498,0.612,0.596) … RGB{N0f8}(0.506,0.553,0.459) RGB{N0f8}(0.392,0.439,0.345); RGB{N0f8}(0.396,0.498,0.486) RGB{N0f8}(0.431,0.533,0.522) … RGB{N0f8}(0.361,0.412,0.294) RGB{N0f8}(0.353,0.408,0.278)], target = "n01440764")
nobs(data)
13394
In this case, each observation is a tuple of an image and the corresponding class; after all, we want to use it for image classification.
image, class = obs
@show class
image
class = "n01440764"
As you saw above, the Datasets
submodule provides functions for loading and creating data containers. We used Datasets.datasetpath
to download a dataset if it wasn’t yet and get the folder it was downloaded to. Then, Datasets.loadtaskdata
took the folder and loaded a data container suitable for image classification. FastAI.jl makes it easy to download the datasets from fastai’s collection on AWS Open Datasets. For the full list, see Datasets.DATASETS
Exercises
- Have a look at the other image classification datasets in
Datasets.DATASETS_IMAGECLASSIFICATION
and change the above code to load a different dataset.
Creating data containers from files
loadtaskdata
makes it easy to get started when your dataset already comes in the correct format, but alas, datasets come in all different shapes and sizes. Let’s create the same data container, but now using more general functions FastAI.jl provides to get a look behind the scenes. If each observation in your dataset is a file in a folder, FileDataset
conveniently creates a data container given a path. We’ll use the path of the downloaded dataset:
using FastAI.Datasets: FileDataset
filedata = FileDataset(dir)
FileDataset("/home/runner/.julia/datadeps/fastai-imagenette2-160/imagenette2-160", 13397 observations)
filedata
is a data container where each observation is a path to a file. We’ll confirm that using getobs
:
p = getobs(filedata, 100)
p"/home/runner/.julia/datadeps/fastai-imagenette2-160/imagenette2-160/train/n01440764/n01440764_10847.JPEG"
Next we need to load an image and the corresponding class from the path. If you have a look at the folder structure of dir
you can see that the parent folder of each file gives the name of class. So we can use the following function to load the (image, class)
pair from a path:
using FastAI.Datasets: loadfile, filename
function loadimageclass(p)
return (
Datasets.loadfile(p),
filename(parent(p)),
)
end
image, class = loadimageclass(p)
@show class
image
class = "n01440764"
Finally, we use mapobs
to lazily transform each observation and have a data container ready to be used for training an image classifier.
data = mapobs(loadimageclass, filedata);
mapobs(loadimageclass, FileDataset("/home/runner/.julia/datadeps/fastai-imagenette2-160/imagenette2-160", 13397 observations))
Exercises
- Using
mapobs
andloadfile
, create a data container where every observation is only an image.
Splitting a data container into subsets
Until now, we’ve only created a single data container containing all observations in a dataset. In practice, though, you’ll want to have at least a training and validation split. The easiest way to get these is to randomly split your data container into two parts. Here we split data
into 80% training and 20% validation data. Note the use of shuffleobs
to make sure each split has approximately the same class distribution.
traindata, valdata = splitobs(shuffleobs(data), at = 0.8);
(DataSubset(::FastAI.Datasets.MappedData, view(::Vector{Int64}, 1:10718), ObsDim.Undefined())
10718 observations, DataSubset(::FastAI.Datasets.MappedData, view(::Vector{Int64}, 10719:13397), ObsDim.Undefined())
2679 observations)
This is great for experimenting, but where possible you will want to use the official training/validation split for a dataset. Consider the image classification dataset folder structure:
- $dir
- train
- class1
- image1.jpg
- image2.jpg
- ...
- class2
- ...
- valid
- class1
- class2
- ...
As you can see, the grandparent folder of each image indicates which split it is a part of. groupobs
allows us to partition a data container using a function. Let’s use it to split filedata
based on the name of the grandparent directory. (We can’t reuse data
for this since it no longer carries the file information.)
trainfiledata, validfiledata = groupobs(filedata) do p
filename(parent(parent(p)))
end
nobs(trainfiledata), nobs(validfiledata)
(2, 1)
Using this official split, it will be easier to compare the performance of your results with those of others’.