EnWiki#
- class streaming.text.EnWiki(local, remote=None, split=None, shuffle=True, prefetch=100000, keep_zip=None, retry=2, timeout=60, hash=None, batch_size=None)[source]#
Implementation of the English Wikipedia 2020-01-01 streaming dataset.
- Parameters
local (str) โ Local filesystem directory where dataset is cached during operation.
remote (str, optional) โ Remote directory (S3 or local filesystem) where dataset is stored. Defaults to
None
.split (str, optional) โ The dataset split to use, either โtrainโ or โvalโ. Defaults to
None
.shuffle (bool) โ Whether to iterate over the samples in randomized order. Defaults to
True
.prefetch (int, optional) โ Target number of samples remaining to prefetch while iterating. Defaults to
100_000
.keep_zip (bool, optional) โ Whether to keep or delete the compressed file when decompressing downloaded shards. If set to None, keep iff remote is local. Defaults to
None
.retry (int) โ Number of download re-attempts before giving up. Defaults to
2
.timeout (float) โ Number of seconds to wait for a shard to download before raising an exception. Defaults to
60
.hash (str, optional) โ Hash or checksum algorithm to use to validate shards. Defaults to
None
.batch_size (int, optional) โ Hint the batch size that will be used on each deviceโs DataLoader. Defaults to
None
.