Data containers

Introduction to data containers giving an overview over the kinds of datasets you can use

DataLoaders . jl is built to integrate with the further ecosystem and builds on a common interface to support datasets . We call such a dataset a data container and it needs to support the following operations:

  • getobs(data, i) loads the i - th observation from a dataset

  • nobs(data) gives the number of observations in a dataset .

Basic data containers

The simplest data container is a vector of values:


			
			
			
			
			using
			
			 
			
			
			



			DataLoaders
	
			
			

			
			
			
			
			@show
			
			 
			
			
			
			
			
			v
			
			 
			
			= 
			
			
			rand
			
			(
			
			
			1
			
			:
			
			10
			
			, 
			
			10
			
			)

			
			
			
			
			@show
			
			 
			
			
			
			
			nobs
			
			(
			
			v
			
			)

			
			
			
			getobs
			
			(
			
			v
			
			, 
			
			1
			
			)

			
			v = rand(1:10, 10) = [5, 1, 4, 8, 8, 8, 1, 9, 5, 10]
nobs(v) = 10

			
			5

Multi - dimensional arrays also work, with the last dimension treated as the observation dimension:


			
			
			
			
			a
			
			 
			
			= 
			
			
			rand
			
			(
			
			50
			
			, 
			
			50
			
			, 
			
			10
			
			)

			
			
			
			summary
			
			(
			
			
			getobs
			
			(
			
			a
			
			, 
			
			1
			
			)
			
			)

			
			50×50 Matrix{Float64}

You can also group multiple data containers with the same length together by putting them into a Tuple :


			
			
			
			
			data
			
			 
			
			= 
			
			
			(
			
			v
			
			, 
			
			a
			
			)

			
			
			
			getobs
			
			(
			
			data
			
			, 
			
			1
			
			)

			
			(5, [0.7150340753032409 0.49766401817628114 … 0.5613436035579596 0.8755339578394489; 0.6332234214178506 0.9129848169089206 … 0.16918290247514878 0.46201538232216766; … ; 0.002108134117038696 0.34241741184069774 … 0.751484293460552 0.2960292690651848; 0.8939841690468608 0.5198491067971963 … 0.7571386133614033 0.6933712044947087])

You can pass any data container to DataLoader to create an iterator over batches:


			
			
			
			
			for
			
			 
			
			
			
			batch
			
			 
			
			in 
			
			



			DataLoader
	
			
			(
			
			v
			
			, 
			
			2
			
			)
    
			
			
			
			
			@assert
			
			 
			
			
			
			
			
			size
			
			(
			
			batch
			
			) 
			
			== 
			
			
			(
			
			2
			
			,
			
			)

			
			
			end
			
			


			
			
			
			
			for
			
			 
			
			
			
			batch
			
			 
			
			in 
			
			



			DataLoader
	
			
			(
			
			a
			
			, 
			
			2
			
			) 
    
			
			
			
			
			@assert
			
			 
			
			
			
			
			
			size
			
			(
			
			batch
			
			) 
			
			== 
			
			
			(
			
			50
			
			, 
			
			50
			
			, 
			
			2
			
			)

			
			
			end
			
			


			
			
			
			
			for
			
			 
			
			
			
			(
			
			vs
			
			, 
			
			as
			
			) 
			
			in 
			
			



			DataLoader
	
			
			(
			
			
			(
			
			v
			
			, 
			
			a
			
			)
			
			, 
			
			2
			
			) 
    
			
			
			
			
			@assert
			
			 
			
			
			
			
			
			size
			
			(
			
			vs
			
			) 
			
			== 
			
			
			(
			
			2
			
			,
			
			)
    
			
			
			
			@assert
			
			 
			
			
			
			
			
			size
			
			(
			
			as
			
			) 
			
			== 
			
			
			(
			
			50
			
			, 
			
			50
			
			, 
			
			2
			
			)

			
			end

Out - of - memory data containers

Arrays, of course, are kept in memory, so we (1) cannot use them to store larger - than - memory datasets (2) don ’ t need to use multithreading since loading an observation just involves indexing an array which is generally fast .

One way to quickly get into the territory of too - large - to - fit in memory is to work with image datasets . So instead of loading every image of a dataset into an array, we ’ ll implement a data container that stores only the file names of each image . It will load the image itself only when getobs is called . To do that we ’ ll implement a struct that stores a vector of file names, and implement getobs and nobs for that type .


			
			
			
			
			import
			
			 
			
			
			



			DataLoaders
	
			
			.
			
			LearnBase
			
			: 
			
			
			getobs
			
			, 
			
			
			
			nobs
			
			

			
			
			
			
			using
			
			 
			
			
			
			Images
			
			


			
			
			
			
			struct
			
			 
			
			
			
			
			ImageDataset
			
			
    
			
			
			
			files
			
			::
			
			
			Vector
			
			{
			
			String
			
			}

			
			
			end
			
			

			
			
			
			
			ImageDataset
			
			(
			
			
			folder
			
			::
			
			String
			
			) 
			
			= 
			
			
			
			ImageDataset
			
			(
			
			
			readdir
			
			(
			
			folder
			
			)
			
			)


			
			
			
			
			nobs
			
			(
			
			
			data
			
			::
			
			ImageDataset
			
			) 
			
			= 
			
			
			
			length
			
			(
			
			
			data
			
			.
			
			
			files
			
			)

			
			
			
			
			getobs
			
			(
			
			
			data
			
			::
			
			ImageDataset
			
			, 
			
			
			i
			
			::
			
			Int
			
			) 
			
			= 
			
			
			
			
			Images
			
			.
			
			
			load
			
			(
			
			
			
			data
			
			.
			
			
			files
			
			[
			
			i
			
			]
			
			)

Now, if we have a folder full of images, we can create a data container and load them quickly into batches as follows:


			
			
			
			
			data
			
			 
			
			= 
			
			
			ImageDataset
			
			(
			
			"path/to/my/images"
			
			)

			
			
			
			
			for
			
			 
			
			
			
			images
			
			 
			
			in 
			
			



			DataLoader
	
			
			(
			
			data
			
			, 
			
			16
			
			, 
			
			
			



			collate
	
			
			 
			
			= 
			
			false
			
			)
    # Do something

			
			
			
			end
Preprocessing

Above we pass the collate = false argument because images may be of different sizes that cannot be collated . See collate . In practice, it is common to apply some cropping and resizing to images so that they all have the same size .

Threads

To use DataLoaders ’ multi - threading, you need to start Julia with multiple threads . Check the number of available threads with Threads.nthreads() .

Backlinks