composer.datasets.lm_datasets#

Generic dataset class for self-supervised training of autoregressive and masked language models.

Hparams

These classes are used with yahp for YAML-based configuration.

LMDatasetHparams

Defines a generic dataset class for self-supervised training of autoregressive and masked language models.

class composer.datasets.lm_datasets.LMDatasetHparams(use_synthetic=False, synthetic_num_unique_samples=100, synthetic_device='cpu', synthetic_memory_format=MemoryFormat.CONTIGUOUS_FORMAT, is_train=True, drop_last=True, shuffle=True, datadir=<factory>, split=None, tokenizer_name=None, use_masked_lm=None, num_tokens=0, mlm_probability=0.15, seed=5, subsample_ratio=1.0, max_seq_length=1024)[source]#

Bases: composer.datasets.hparams.DatasetHparams, composer.datasets.hparams.SyntheticHparamsMixin

Defines a generic dataset class for self-supervised training of autoregressive and masked language models.

Parameters

use_synthetic (bool, optional) – Whether to use synthetic data. Default: False.
synthetic_num_unique_samples (int, optional) – The number of unique samples to allocate memory for. Ignored if use_synthetic is False. Default: 100.
synthetic_device (str, optional) – The device to store the sample pool on. Set to 'cuda' to store samples on the GPU and eliminate PCI-e bandwidth with the dataloader. Set to 'cpu' to move data between host memory and the device on every batch. Ignored if use_synthetic is False. Default: 'cpu'.
synthetic_memory_format – The MemoryFormat to use. Ignored if use_synthetic is False. Default: 'CONTIGUOUS_FORMAT'.
datadir (list) – The path to the data directory.
is_train (bool) – Whether to load the training data or validation data. Default: True.
drop_last (bool) – If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default: True.
shuffle (bool) – Whether to shuffle the dataset. Default: True.
datadir – List containing the string of the path to the HuggingFace Datasets directory.
split (str) – Whether to use 'train', 'test', or 'validation' split.
tokenizer_name (str) – The name of the HuggingFace tokenizer to preprocess text with. See HuggingFace documentation.
use_masked_lm (bool) – Whether the dataset should be encoded with masked language modeling or not.
num_tokens (int, optional) – Number of tokens to train on. 0 will train on all tokens in the dataset. Default: 0.
mlm_probability (float, optional) – If using masked language modeling, the probability with which tokens will be masked. Default: 0.15.
seed (int, optional) – Random seed for generating train and validation splits. Default: 5.
subsample_ratio (float, optional) – Proportion of the dataset to use. Default: 1.0.
train_sequence_length (int, optional) – Sequence length for training dataset. Default: 1024.
val_sequence_length (int, optional) – Sequence length for validation dataset. Default: 1024.

initialize_object(batch_size, dataloader_hparams)[source]#

Creates a DataLoader or DataSpec for this dataset.

Parameters

batch_size (int) – The size of the batch the dataloader should yield. This batch size is device-specific and already incorporates the world size.
dataloader_hparams (DataLoaderHparams) – The dataset-independent hparams for the dataloader.

Returns

DataLoader or DataSpec – The DataLoader, or if the dataloader yields batches of custom types, a DataSpec.

validate()[source]#

Validate that the hparams are of the correct types. Recurses through sub-hparams.

Raises: TypeError – Raises a TypeError if any fields are an incorrect type.