composer.datasets.imagenet#
ImageNet classfication dataset.
The most widely used dataset for Image Classification algorithms. Please refer to the ImageNet 2012 Classification Dataset for more details. Also includes streaming dataset versions based on the WebDatasets.
Hparams
These classes are used with yahp
for YAML
-based configuration.
Defines an instance of the ImageNet-1k WebDataset for image classification. |
|
Defines an instance of the ImageNet dataset for image classification. |
|
Defines an instance of the TinyImagenet-200 WebDataset for image classification. |
- class composer.datasets.imagenet.Imagenet1kWebDatasetHparams(is_train=True, drop_last=True, shuffle=True, datadir=None, webdataset_cache_dir='/tmp/webdataset_cache/', webdataset_cache_verbose=False, shuffle_buffer=256, remote='s3://mosaicml-internal-dataset-imagenet1k', name='imagenet1k', resize_size=- 1, crop_size=224)[source]#
Bases:
composer.datasets.hparams.WebDatasetHparams
Defines an instance of the ImageNet-1k WebDataset for image classification.
- Parameters
datadir (str) โ The path to the data directory.
is_train (bool) โ Whether to load the training data or validation data. Default:
True
.drop_last (bool) โ If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default:
True
.shuffle (bool) โ Whether to shuffle the dataset. Default:
True
.datadir โ The path to the data directory.
is_train โ Whether to load the training data or validation data. Default:
True
.drop_last โ If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default:
True
.shuffle โ Whether to shuffle the dataset. Default:
True
.webdataset_cache_dir (str) โ WebDataset cache directory.
webdataset_cache_verbose (str) โ WebDataset cache verbosity.
remote (str) โ S3 bucket or root directory where dataset is stored. Default:
's3://mosaicml-internal-dataset-imagenet1k'
.name (str) โ Key used to determine where dataset is cached on local filesystem. Default:
'imagenet1k'
.resize_size (int, optional) โ The resize size to use. Use -1 to not resize. Default:
-1
.size (crop) โ The crop size to use. Default:
224
.
- initialize_object(batch_size, dataloader_hparams)[source]#
Creates a
DataLoader
orDataSpec
for this dataset.- Parameters
batch_size (int) โ The size of the batch the dataloader should yield. This batch size is device-specific and already incorporates the world size.
dataloader_hparams (DataLoaderHparams) โ The dataset-independent hparams for the dataloader.
- Returns
DataLoader or DataSpec โ The
DataLoader
, or if the dataloader yields batches of custom types, aDataSpec
.
- class composer.datasets.imagenet.ImagenetDatasetHparams(use_synthetic=False, synthetic_num_unique_samples=100, synthetic_device='cpu', synthetic_memory_format=MemoryFormat.CONTIGUOUS_FORMAT, is_train=True, drop_last=True, shuffle=True, datadir=None, resize_size=- 1, crop_size=224, use_ffcv=False, ffcv_dir='/tmp', ffcv_dest_train='train.ffcv', ffcv_dest_val='val.ffcv', ffcv_write_dataset=False)[source]#
Bases:
composer.datasets.hparams.DatasetHparams
,composer.datasets.hparams.SyntheticHparamsMixin
Defines an instance of the ImageNet dataset for image classification.
- Parameters
use_synthetic (bool, optional) โ Whether to use synthetic data. Default:
False
.synthetic_num_unique_samples (int, optional) โ The number of unique samples to allocate memory for. Ignored if
use_synthetic
isFalse
. Default:100
.synthetic_device (str, optional) โ The device to store the sample pool on. Set to
'cuda'
to store samples on the GPU and eliminate PCI-e bandwidth with the dataloader. Set to'cpu'
to move data between host memory and the device on every batch. Ignored ifuse_synthetic
isFalse
. Default:'cpu'
.synthetic_memory_format โ The
MemoryFormat
to use. Ignored ifuse_synthetic
isFalse
. Default:'CONTIGUOUS_FORMAT'
.datadir (str) โ The path to the data directory.
is_train (bool) โ Whether to load the training data or validation data. Default:
True
.drop_last (bool) โ If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default:
True
.shuffle (bool) โ Whether to shuffle the dataset. Default:
True
.resize_size (int, optional) โ The resize size to use. Use
-1
to not resize. Default:-1
.size (crop) โ The crop size to use. Default:
224
.use_ffcv (bool) โ Whether to use FFCV dataloaders. Default:
False
.ffcv_dir (str) โ A directory containing train/val <file>.ffcv files. If these files donโt exist and
ffcv_write_dataset
isTrue
, train/val <file>.ffcv files will be created in this dir. Default:"/tmp"
.ffcv_dest_train (str) โ <file>.ffcv file that has training samples. Default:
"train.ffcv"
.ffcv_dest_val (str) โ <file>.ffcv file that has validation samples. Default:
"val.ffcv"
.ffcv_write_dataset (std) โ Whether to create dataset in FFCV format (<file>.ffcv) if it doesnโt exist. Default:
False. โ
- initialize_object(batch_size, dataloader_hparams)[source]#
Creates a
DataLoader
orDataSpec
for this dataset.- Parameters
batch_size (int) โ The size of the batch the dataloader should yield. This batch size is device-specific and already incorporates the world size.
dataloader_hparams (DataLoaderHparams) โ The dataset-independent hparams for the dataloader.
- Returns
DataLoader or DataSpec โ The
DataLoader
, or if the dataloader yields batches of custom types, aDataSpec
.
- class composer.datasets.imagenet.TinyImagenet200WebDatasetHparams(is_train=True, drop_last=True, shuffle=True, datadir=None, webdataset_cache_dir='/tmp/webdataset_cache/', webdataset_cache_verbose=False, shuffle_buffer=256, remote='s3://mosaicml-internal-dataset-tinyimagenet200', name='tinyimagenet200', n_train_samples=100000, n_val_samples=10000, height=64, width=64, n_classes=200, channel_means=(0.485, 0.456, 0.406), channel_stds=(0.229, 0.224, 0.225))[source]#
Bases:
composer.datasets.hparams.WebDatasetHparams
Defines an instance of the TinyImagenet-200 WebDataset for image classification.
- Parameters
datadir (str) โ The path to the data directory.
is_train (bool) โ Whether to load the training data or validation data. Default:
True
.drop_last (bool) โ If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default:
True
.shuffle (bool) โ Whether to shuffle the dataset. Default:
True
.datadir โ The path to the data directory.
is_train โ Whether to load the training data or validation data. Default:
True
.drop_last โ If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default:
True
.shuffle โ Whether to shuffle the dataset. Default:
True
.webdataset_cache_dir (str) โ WebDataset cache directory.
webdataset_cache_verbose (str) โ WebDataset cache verbosity.
remote (str) โ S3 bucket or root directory where dataset is stored. Default:
's3://mosaicml-internal-dataset-tinyimagenet200'
.name (str) โ Key used to determine where dataset is cached on local filesystem. Default:
'tinyimagenet200'
.n_train_samples (int) โ Number of training samples. Default:
100000
.n_val_samples (int) โ Number of validation samples. Default:
10000
.height (int) โ Sample image height in pixels. Default:
64
.width (int) โ Sample image width in pixels. Default:
64
.n_classes (int) โ Number of output classes. Default:
200
.channel_means (list of float) โ Channel means for normalization. Default:
(0.485, 0.456, 0.406)
.channel_stds (list of float) โ Channel stds for normalization. Default:
(0.229, 0.224, 0.225)
.
- initialize_object(batch_size, dataloader_hparams)[source]#
Creates a
DataLoader
orDataSpec
for this dataset.- Parameters
batch_size (int) โ The size of the batch the dataloader should yield. This batch size is device-specific and already incorporates the world size.
dataloader_hparams (DataLoaderHparams) โ The dataset-independent hparams for the dataloader.
- Returns
DataLoader or DataSpec โ The
DataLoader
, or if the dataloader yields batches of custom types, aDataSpec
.