composer.datasets.build_synthetic_lm_dataloader(synthetic_num_unique_samples, tokenizer_name, global_batch_size, *, split='train', shuffle=True, drop_last=True, use_masked_lm=False, num_tokens=0, mlm_probability=0.15, subsample_ratio=1.0, max_seq_length=1024, **dataloader_kwargs)[source]#

Builds a synthetic dataloader for a generic language modeling dataset.

  • synthetic_num_unique_samples (int) โ€“ Number of unique synthetic samples to generate.

  • tokenizer_name (str) โ€“ The name of the HuggingFace tokenizer to preprocess text with. See HuggingFace documentation.

  • global_batch_size (int) โ€“

  • split (str) โ€“ the dataset split to use either โ€˜trainโ€™, โ€˜valโ€™, or โ€˜testโ€™. Default:

  • Default ('train`.) โ€“ 'train'.

  • shuffle (bool) โ€“ whether to shuffle the dataset. Default: True.

  • drop_last (bool) โ€“ whether to drop last samples. Default: True.

  • use_masked_lm (bool) โ€“ Whether the dataset should be encoded with masked language modeling or not.

  • num_tokens (int, optional) โ€“ Number of tokens to train on. 0 will train on all tokens in the dataset. Default: 0.

  • mlm_probability (float, optional) โ€“ If using masked language modeling, the probability with which tokens will be masked. Default: 0.15.

  • subsample_ratio (float, optional) โ€“ Proportion of the dataset to use. Default: 1.0.

  • max_seq_length (int, optional) โ€“ Maximum sequence length for datasets. Default: 1024.

  • **dataloader_kwargs (Dict[str, Any]) โ€“ Additional settings for the dataloader (e.g. num_workers, etc.)