build_lm_dataloader#
- composer.datasets.build_lm_dataloader(datadir, tokenizer_name, global_batch_size, *, split='train', shuffle=True, drop_last=True, use_masked_lm=False, num_tokens=0, mlm_probability=0.15, subsample_ratio=1.0, **dataloader_kwargs)[source]#
Builds a dataloader for a generic language modeling dataset.
- Parameters
datadir (list) โ List containing the string of the path to the HuggingFace Datasets directory.
dataloader_hparams (DataLoaderHparams) โ DataLoaderHparams object.
tokenizer_name (str) โ The name of the HuggingFace tokenizer to preprocess text with. See HuggingFace documentation.
global_batch_size (int) โ Global batch size.
split (str) โ the dataset split to use either โtrainโ, โvalโ, or โtestโ. Default:
'train`
. Default:'train'
.shuffle (bool) โ whether to shuffle the dataset. Default:
True
.drop_last (bool) โ whether to drop last samples. Default:
True
.use_masked_lm (bool) โ Whether the dataset should be encoded with masked language modeling or not.
num_tokens (int, optional) โ Number of tokens to train on.
0
will train on all tokens in the dataset. Default:0
.mlm_probability (float, optional) โ If using masked language modeling, the probability with which tokens will be masked. Default:
0.15
.subsample_ratio (float, optional) โ Proportion of the dataset to use. Default:
1.0
.**dataloader_kwargs (Dict[str, Any]) โ Additional settings for the dataloader (e.g. num_workers, etc.)