composer.datasets.c4_hparams#

C4 (Colossal Cleaned Common Crawl) dataset hyperparameters.

Hparams

These classes are used with yahp for YAML-based configuration.

C4DatasetHparams

Builds a DataSpec for the C4 (Colossal Cleaned Common Crawl) dataset.

StreamingC4Hparams

Builds a DataSpec for the StreamingC4 (Colossal Cleaned Common Crawl) dataset.

class composer.datasets.c4_hparams.C4DatasetHparams(drop_last=True, shuffle=True, split=None, num_samples=None, tokenizer_name=None, max_seq_len=None, group_method=None, mlm=False, mlm_probability=0.15, shuffle_buffer_size=10000, seed=5)[source]#

Bases: composer.datasets.dataset_hparams.DatasetHparams

Builds a DataSpec for the C4 (Colossal Cleaned Common Crawl) dataset.

Parameters
  • split (str) โ€“ What split of the dataset to use. Either 'train' or 'validation'. Default: None.

  • num_samples (int) โ€“ The number of post-processed token samples, used to set epoch size of the torch.utils.data.IterableDataset. Default: None.

  • tokenizer_name (str) โ€“ The name of the HuggingFace tokenizer to preprocess text with. Default: None.

  • max_seq_len (int) โ€“ The max sequence length of each token sample. Default: None.

  • group_method (str) โ€“ How to group text samples into token samples. Either truncate or concat. Default: None.

  • mlm (bool) โ€“ Whether or not to use masked language modeling. Default: False.

  • mlm_probability (float) โ€“ If mlm==True, the probability that tokens are masked. Default: 0.15.

  • shuffle (bool) โ€“ Whether to shuffle the samples in the dataset. Currently, shards are assigned and consumed with deterministic per-device shard order, but shuffling affects the order of samples via (per-device) shuffle buffers. Default: False.

  • shuffle_buffer_size (int) โ€“ If shuffle=True, samples are read into a buffer of this size (per-device), and randomly sampled from there to produce shuffled samples. Default: 10000.

  • seed (int) โ€“ If shuffle=True, what seed to use for shuffling operations. Default: 5.

  • drop_last (bool) โ€“ Whether to drop the last samples for the last batch. Default: True.

Returns

DataLoader โ€“ A PyTorch DataLoader object.

class composer.datasets.c4_hparams.StreamingC4Hparams(drop_last=True, shuffle=True, remote='s3://mosaicml-internal-dataset-c4/mds/1/', local='/tmp/mds-cache/mds-c4/', split='train', tokenizer_name='bert-base-uncased', max_seq_len=512, group_method='truncate', mlm=False, mlm_probability=0.15)[source]#

Bases: composer.datasets.dataset_hparams.DatasetHparams

Builds a DataSpec for the StreamingC4 (Colossal Cleaned Common Crawl) dataset.

Parameters
  • remote (str) โ€“ Remote directory (S3 or local filesystem) where dataset is stored. Default: 's3://mosaicml-internal-dataset-c4/mds/1/'

  • local (str) โ€“ Local filesystem directory where dataset is cached during operation. Default: '/tmp/mds-cache/mds-c4/'

  • split (str) โ€“ What split of the dataset to use. Either 'train' or 'val'. Default: 'train'.

  • tokenizer_name (str) โ€“ The name of the HuggingFace tokenizer to preprocess text with. Default: 'bert-base-uncased'.

  • max_seq_len (int) โ€“ The max sequence length of each token sample. Default: 512.

  • group_method (str) โ€“ How to group text samples into token samples. Currently only truncate is supported.

  • mlm (bool) โ€“ Whether or not to use masked language modeling. Default: False.

  • mlm_probability (float) โ€“ If mlm==True, the probability that tokens are masked. Default: 0.15.