StreamingC4#
- class streaming.text.StreamingC4(tokenizer_name, max_seq_len, group_method, local, remote=None, split=None, shuffle=False, predownload=100000, keep_zip=None, download_retry=2, download_timeout=60, validate_hash=None, shuffle_seed=None, num_canonical_nodes=None, batch_size=None)[source]#
- Implementation of the C4 (Colossal Cleaned Common Crawl) dataset using StreamingDataset. - Parameters
- tokenizer_name (str) β The name of the HuggingFace tokenizer to use to tokenize samples. 
- max_seq_len (int) β The max sequence length of each token sample. 
- group_method (str) β How to group text samples into token samples. Currently only supporting - 'truncate'.
- local (str) β Local dataset directory where shards are cached by split. 
- remote (str, optional) β Download shards from this remote path or directory. If None, this rank and workerβs partition of the dataset must all exist locally. Defaults to - None.
- split (str, optional) β Which dataset split to use, if any. Defaults to - None.
- shuffle (bool) β Whether to iterate over the samples in randomized order. Defaults to - False.
- predownload (int, optional) β Target number of samples ahead to download the shards of while iterating. Defaults to - 100_000.
- keep_zip (bool, optional) β Whether to keep or delete the compressed file when decompressing downloaded shards. If set to None, keep iff remote is local. Defaults to - None.
- download_retry (int) β Number of download re-attempts before giving up. Defaults to - 2.
- download_timeout (float) β Number of seconds to wait for a shard to download before raising an exception. Defaults to - 60.
- validate_hash (str, optional) β Optional hash or checksum algorithm to use to validate shards. Defaults to - None.
- shuffle_seed (int, optional) β Seed for shuffling, or - Nonefor random seed. Defaults to- None.
- num_canonical_nodes (int, optional) β Canonical number of nodes for shuffling with resumption. Defaults to - None, which is interpreted as the number of nodes of the initial run.
- batch_size (int, optional) β Batch size of its DataLoader, which affects how the dataset is partitioned over the workers. Defaults to - None.