StreamingC4#
- class streaming.text.StreamingC4(*, remote=None, local=None, split=None, download_retry=2, download_timeout=60, validate_hash=None, keep_zip=False, epoch_size=None, predownload=None, cache_limit=None, partition_algo='orig', num_canonical_nodes=None, batch_size=None, shuffle=False, shuffle_algo='py1s', shuffle_seed=9176, shuffle_block_size=262144, tokenizer_name, max_seq_len, group_method)[source]#
Implementation of the C4 (Colossal Cleaned Common Crawl) dataset using StreamingDataset.
- Parameters
remote (str, optional) β Remote path or directory to download the dataset from. If
None
, its data must exist locally. StreamingDataset uses eitherstreams
orremote
/local
. Defaults toNone
.local (str, optional) β Local working directory to download shards to. This is where shards are cached while they are being used. Uses a temp directory if not set. StreamingDataset uses either
streams
orremote
/local
. Defaults toNone
.split (str, optional) β Which dataset split to use, if any. If provided, we stream from/to the
split
subdirs ofremote
andlocal
. Defaults toNone
.download_retry (int) β Number of download re-attempts before giving up. Defaults to
2
.download_timeout (float) β Number of seconds to wait for a shard to download before raising an exception. Defaults to
60
.validate_hash (str, optional) β Optional hash or checksum algorithm to use to validate shards. Defaults to
None
.keep_zip (bool) β Whether to keep or delete the compressed form when decompressing downloaded shards. If
False
, keep iff remote is local or no remote. Defaults toFalse
.epoch_size (int, optional) β Number of samples to draw per epoch balanced across all streams. If
None
, takes its value from the total number of underlying samples. Provide this field if you are weighting streams relatively to target a larger or smaller epoch size. Defaults toNone
.predownload (int, optional) β Target number of samples to download per worker in advance of current sample. Workers will attempt to download ahead by this many samples during, but not before, training. Recommendation is to provide a value greater than per device batch size to ensure at-least per device batch size number of samples cached locally. If
None
, its value gets derived using per device batch size and number of canonical nodesmax(batch_size, 256 * batch_size // num_canonical_nodes)
. Defaults toNone
.cache_limit (int, optional) β Maximum size in bytes of this StreamingDatasetβs shard cache. Before downloading a shard, the least recently used resident shard(s) may be evicted (deleted from the local cache) in order to stay under the limit. Set to
None
to disable shard eviction. Defaults toNone
.partition_algo (str) β Which partitioning algorithm to use. Defaults to
orig
.num_canonical_nodes (int, optional) β
Canonical number of nodes for shuffling with resumption. The sample space is divided evenly according to the number of canonical nodes. The higher the value, the more independent non-overlapping paths the StreamingDataset replicas take through the shards per model replica (increasing data source diversity). Defaults to
None
, which is interpreted as 64 times the number of nodes of the initial run.Note
For sequential sample ordering, set
shuffle
toFalse
andnum_canonical_nodes
to the number of physical nodes of the initial run.batch_size (int, optional) β Per-device batch size, the same as what is passed to the DataLoader. This affects how the dataset is partitioned over the workers and is necessary for deterministic resumption and optimal performance. Defaults to
None
.shuffle (bool) β Whether to iterate over the samples in randomized order. Defaults to
False
.shuffle_algo (str) β Which shuffling algorithm to use. Defaults to
py1s
.shuffle_seed (int) β Seed for Deterministic data shuffling. Defaults to
9176
.shuffle_block_size (int) β Unit of shuffle. Defaults to
1 << 18
.tokenizer_name (str) β The name of the HuggingFace tokenizer to use to tokenize samples.
max_seq_len (int) β The max sequence length of each token sample.
group_method (str) β How to group text samples into token samples. Currently only supporting
'truncate'
.