GLUE (General Language Understanding Evaluation) dataset hyperparameters (Wang et al, 2019).

The GLUE benchmark datasets consist of nine sentence- or sentence-pair language understanding tasks designed to cover a diverse range of dataset sizes, text genres, and degrees of difficulty.

Note that the GLUE diagnostic dataset, which is designed to evaluate and analyze model performance with respect to a wide range of linguistic phenomena found in natural language, is not included here.

Please refer to the GLUE benchmark for more details.


These classes are used with yahp for YAML-based configuration.


Sets up a generic GLUE dataset loader.

class composer.datasets.glue_hparams.GLUEHparams(use_synthetic=False, synthetic_num_unique_samples=100, synthetic_device='cpu', synthetic_memory_format=MemoryFormat.CONTIGUOUS_FORMAT, drop_last=True, shuffle=True, task=None, tokenizer_name=None, split=None, max_seq_length=256, max_network_retries=10)[source]#

Bases: composer.datasets.dataset_hparams.DatasetHparams, composer.datasets.synthetic_hparams.SyntheticHparamsMixin

Sets up a generic GLUE dataset loader.

  • task (str) โ€“ the GLUE task to train on, choose one from: 'CoLA', 'MNLI', 'MRPC', 'QNLI', 'QQP', 'RTE', 'SST-2', and 'STS-B'.

  • tokenizer_name (str) โ€“ The name of the HuggingFace tokenizer to preprocess text with. See HuggingFace documentation.

  • split (str) โ€“ Whether to use 'train', 'validation', or 'test' split.

  • max_seq_length (int, optional) โ€“ Optionally, the ability to set a custom sequence length for the training dataset. Default: 256.

  • max_network_retries (int, optional) โ€“ Number of times to retry HTTP requests if they fail. Default: 10.


DataLoader โ€“ A PyTorch DataLoader object.