Introduction to data containers giving an overview over the kinds of datasets you can use
DataLoaders . jl is built to integrate with the further ecosystem and builds on a common interface to support datasets . We call such a dataset a data container and it needs to support the following operations:
getobs(data, i)
loads the
i
-
th observation from a dataset
nobs(data)
gives the number of observations in a dataset
.
The simplest data container is a vector of values:
using
DataLoaders
@show
v
=
rand
(
1
:
10
,
10
)
@show
nobs
(
v
)
getobs
(
v
,
1
)
v = rand(1:10, 10) = [5, 1, 4, 8, 8, 8, 1, 9, 5, 10]
nobs(v) = 10
5
Multi - dimensional arrays also work, with the last dimension treated as the observation dimension:
a
=
rand
(
50
,
50
,
10
)
summary
(
getobs
(
a
,
1
)
)
50×50 Matrix{Float64}
You can also group multiple data containers with the same length together by putting them into a
Tuple
:
data
=
(
v
,
a
)
getobs
(
data
,
1
)
(5, [0.7150340753032409 0.49766401817628114 … 0.5613436035579596 0.8755339578394489; 0.6332234214178506 0.9129848169089206 … 0.16918290247514878 0.46201538232216766; … ; 0.002108134117038696 0.34241741184069774 … 0.751484293460552 0.2960292690651848; 0.8939841690468608 0.5198491067971963 … 0.7571386133614033 0.6933712044947087])
You can pass any data container to
DataLoader
to create an iterator over batches:
for
batch
in
DataLoader
(
v
,
2
)
@assert
size
(
batch
)
==
(
2
,
)
end
for
batch
in
DataLoader
(
a
,
2
)
@assert
size
(
batch
)
==
(
50
,
50
,
2
)
end
for
(
vs
,
as
)
in
DataLoader
(
(
v
,
a
)
,
2
)
@assert
size
(
vs
)
==
(
2
,
)
@assert
size
(
as
)
==
(
50
,
50
,
2
)
end
Arrays, of course, are kept in memory, so we (1) cannot use them to store larger - than - memory datasets (2) don ’ t need to use multithreading since loading an observation just involves indexing an array which is generally fast .
One way to quickly get into the territory of too
-
large
-
to
-
fit in memory is to work with image datasets
.
So instead of loading every image of a dataset into an array, we
’
ll implement a data container that stores only the file names of each image
.
It will load the image itself only when
getobs
is called
.
To do that we
’
ll implement a
struct
that stores a vector of file names, and implement
getobs
and
nobs
for that type
.
import
DataLoaders
.
LearnBase
:
getobs
,
nobs
using
Images
struct
ImageDataset
files
::
Vector
{
String
}
end
ImageDataset
(
folder
::
String
)
=
ImageDataset
(
readdir
(
folder
)
)
nobs
(
data
::
ImageDataset
)
=
length
(
data
.
files
)
getobs
(
data
::
ImageDataset
,
i
::
Int
)
=
Images
.
load
(
data
.
files
[
i
]
)
Now, if we have a folder full of images, we can create a data container and load them quickly into batches as follows:
data
=
ImageDataset
(
"path/to/my/images"
)
for
images
in
DataLoader
(
data
,
16
,
collate
=
false
)
# Do something
end
Above we pass the
collate = false
argument because images may be of different sizes that cannot be collated
.
See
collate
.
In practice, it is common to apply some cropping and resizing to images so that they all have the same size
.
To use
DataLoaders
’
multi
-
threading, you need to start Julia with multiple
threads
.
Check the number of available threads with
Threads.nthreads()
.
The following pages link back here:
Comparison to PyTorch , Data container interface , DataLoaders.jl , Inplace loading , toc