GLUE (General Language Understanding Evaluation) dataset (Wang et al, 2019).

The GLUE benchmark datasets consist of nine sentence- or sentence-pair language understanding tasks designed to cover a diverse range of dataset sizes, text genres, and degrees of difficulty.

Note that the GLUE diagnostic dataset, which is designed to evaluate and analyze model performance with respect to a wide range of linguistic phenomena found in natural language, is not included here.

Please refer to the GLUE benchmark for more details.


These classes are used with yahp for YAML-based configuration.


Sets up a generic GLUE dataset loader.

class composer.datasets.glue.GLUEHparams(use_synthetic=False, synthetic_num_unique_samples=100, synthetic_device='cpu', synthetic_memory_format=MemoryFormat.CONTIGUOUS_FORMAT, is_train=True, drop_last=True, shuffle=True, datadir=None, task=None, tokenizer_name=None, split=None, max_seq_length=256, max_network_retries=10)[source]#

Bases: composer.datasets.hparams.DatasetHparams, composer.datasets.hparams.SyntheticHparamsMixin

Sets up a generic GLUE dataset loader.

  • use_synthetic (bool, optional) โ€“ Whether to use synthetic data. Default: False.

  • synthetic_num_unique_samples (int, optional) โ€“ The number of unique samples to allocate memory for. Ignored if use_synthetic is False. Default: 100.

  • synthetic_device (str, optional) โ€“ The device to store the sample pool on. Set to 'cuda' to store samples on the GPU and eliminate PCI-e bandwidth with the dataloader. Set to 'cpu' to move data between host memory and the device on every batch. Ignored if use_synthetic is False. Default: 'cpu'.

  • synthetic_memory_format โ€“ The MemoryFormat to use. Ignored if use_synthetic is False. Default: 'CONTIGUOUS_FORMAT'.

  • datadir (str) โ€“ The path to the data directory.

  • is_train (bool) โ€“ Whether to load the training data or validation data. Default: True.

  • drop_last (bool) โ€“ If the number of samples is not divisible by the batch size, whether to drop the last batch or pad the last batch with zeros. Default: True.

  • shuffle (bool) โ€“ Whether to shuffle the dataset. Default: True.

  • task (str) โ€“ the GLUE task to train on, choose one from: 'CoLA', 'MNLI', 'MRPC', 'QNLI', 'QQP', 'RTE', 'SST-2', and 'STS-B'.

  • tokenizer_name (str) โ€“ The name of the HuggingFace tokenizer to preprocess text with. See HuggingFace documentation.

  • split (str) โ€“ Whether to use 'train', 'validation', or 'test' split.

  • max_seq_length (int, optional) โ€“ Optionally, the ability to set a custom sequence length for the training dataset. Default: 256.

  • max_network_retries (int, optional) โ€“ Number of times to retry HTTP requests if they fail. Default: 10.


DataLoader โ€“ A PyTorch DataLoader object.

initialize_object(batch_size, dataloader_hparams)[source]#

Creates a DataLoader or DataSpec for this dataset.

  • batch_size (int) โ€“ The size of the batch the dataloader should yield. This batch size is device-specific and already incorporates the world size.

  • dataloader_hparams (DataLoaderHparams) โ€“ The dataset-independent hparams for the dataloader.

  • Iterable | DataSpec โ€“ An iterable that yields batches, or if the dataset yields batches that need custom

  • processing, a :class:`~core.data_spec.DataSpec`.


Validate that the hparams are of the correct types. Recurses through sub-hparams.


TypeError โ€“ Raises a TypeError if any fields are an incorrect type.