- class composer.datasets.C4Dataset(split, num_samples, tokenizer_name, max_seq_len, group_method, shuffle=False, shuffle_buffer_size=10000, seed=5)#
Builds a streaming, sharded, sized
torch.utils.data.IterableDatasetfor the C4 (Colossal Cleaned Common Crawl) dataset. Used for pretraining autoregressive or masked language models. Text samples are streamed directly from the cloud using HuggingFace’s C4 Dataset with streaming backend (See https://huggingface.co/datasets/c4 for more details). The text samples are then shuffled, tokenized, and grouped on- the-fly.
split (str) – What split of the dataset to use. Either
num_samples (int) – The number of post-processed token samples, used to set epoch size of the
tokenizer_name (str) – The name of the HuggingFace tokenizer to preprocess text with.
max_seq_len (int) – The max sequence length of each token sample.
group_method (str) – How to group text samples into token samples. Either
shuffle (bool) – Whether to shuffle the samples in the dataset. Currently, shards are assigned and consumed with deterministic per-device shard order, but shuffling affects the order of samples via (per-device) shuffle buffers. Default:
shuffle_buffer_size (int) – If
shuffle=True, samples are read into a buffer of this size (per-device), and randomly sampled from there to produce shuffled samples. Default:
seed (int) – If
shuffle=True, what seed to use for shuffling operations. Default:
IterableDataset – A