composer.datasets.synthetic_lm#
Synthetic language modeling datasets used for testing, profiling, and debugging.
Functions
Generates a synthetic tokenizer based on a tokenizer family. |
|
Creates a synthetic |
Classes
composer.datasets.synthetic_lm.SyntheticTokenizerParams |
- class composer.datasets.synthetic_lm.SyntheticTokenizerParams(tokenizer_model, normalizer, pre_tokenizer, decoder, initial_alphabet, special_tokens, pad_token, trainer_cls, tokenizer_cls)[source]#
Bases:
tuple
composer.datasets.synthetic_lm.SyntheticTokenizerParams
- composer.datasets.synthetic_lm.generate_synthetic_tokenizer(tokenizer_family, dataset=None, vocab_size=256)[source]#
Generates a synthetic tokenizer based on a tokenizer family.
- Parameters
tokenizer_family (str) โ Which tokenizer family to emulate. One of [โgpt2โ, โbertโ].
dataset (Optional[datasets.Dataset]) โ Optionally, the dataset to train the tokenzier off of. If
None
, aSyntheticHFDataset
will be generated. Default:None
.vocab_size (int) โ The size of the tokenizer vocabulary. Defaults to 256.