build_streaming_c4_dataloader#
- composer.datasets.build_streaming_c4_dataloader(global_batch_size, remote='s3://mosaicml-internal-dataset-c4/mds/2/', local='/tmp/mds-cache/mds-c4/', split='train', shuffle=True, drop_last=True, tokenizer_name='bert-base-uncased', max_seq_len=512, group_method='truncate', mlm=False, mlm_probability=0.15, predownload=100000, keep_zip=None, download_retry=2, download_timeout=60, validate_hash=None, shuffle_seed=None, num_canonical_nodes=None, **dataloader_kwargs)[source]#
- Builds a - DataSpecfor the StreamingC4 (Colossal Cleaned Common Crawl) dataset.- Parameters
- global_batch_size (int) โ Global batch size. 
- remote (str) โ Remote directory (S3 or local filesystem) where dataset is stored. Default: - 's3://mosaicml-internal-dataset-c4/mds/2/'
- local (str) โ Local filesystem directory where dataset is cached during operation. Default: - '/tmp/mds-cache/mds-c4/'
- split (str) โ What split of the dataset to use. Either - 'train'or- 'val'. Default:- 'train'.
- shuffle (bool) โ whether to shuffle the dataset. Default: - True.
- drop_last (bool) โ whether to drop last samples. Default: - True.
- tokenizer_name (str) โ The name of the HuggingFace tokenizer to preprocess text with. Default: - 'bert-base-uncased'.
- max_seq_len (int) โ The max sequence length of each token sample. Default: - 512.
- group_method (str) โ How to group text samples into token samples. Currently only truncate is supported. 
- mlm (bool) โ Whether or not to use masked language modeling. Default: - False.
- mlm_probability (float) โ If - mlm==True, the probability that tokens are masked. Default:- 0.15.
- predownload (int, optional) โ Target number of samples ahead to download the shards of while iterating. Defaults to - 100_000.
- keep_zip (bool, optional) โ Whether to keep or delete the compressed file when decompressing downloaded shards. If set to None, keep iff remote is local. Defaults to - None.
- download_retry (int) โ Number of download re-attempts before giving up. Defaults to - 2.
- download_timeout (float) โ Number of seconds to wait for a shard to download before raising an exception. Defaults to - 60.
- validate_hash (str, optional) โ Optional hash or checksum algorithm to use to validate shards. Defaults to - None.
- shuffle_seed (int, optional) โ Seed for shuffling, or - Nonefor random seed. Defaults to- None.
- num_canonical_nodes (int, optional) โ Canonical number of nodes for shuffling with resumption. Defaults to - None, which is interpreted as the number of nodes of the initial run.
- **dataloader_kwargs (Dict[str, Any]) โ Additional settings for the dataloader (e.g. num_workers, etc.)