EnWiki#

class streaming.text.EnWiki(local, remote=None, split=None, shuffle=True, prefetch=100000, keep_zip=None, retry=2, timeout=60, hash=None, batch_size=None)[source]#

Implementation of the English Wikipedia 2020-01-01 streaming dataset.

Parameters
  • local (str) โ€“ Local filesystem directory where dataset is cached during operation.

  • remote (str, optional) โ€“ Remote directory (S3 or local filesystem) where dataset is stored. Defaults to None.

  • split (str, optional) โ€“ The dataset split to use, either โ€˜trainโ€™ or โ€˜valโ€™. Defaults to None.

  • shuffle (bool) โ€“ Whether to iterate over the samples in randomized order. Defaults to True.

  • prefetch (int, optional) โ€“ Target number of samples remaining to prefetch while iterating. Defaults to 100_000.

  • keep_zip (bool, optional) โ€“ Whether to keep or delete the compressed file when decompressing downloaded shards. If set to None, keep iff remote is local. Defaults to None.

  • retry (int) โ€“ Number of download re-attempts before giving up. Defaults to 2.

  • timeout (float) โ€“ Number of seconds to wait for a shard to download before raising an exception. Defaults to 60.

  • hash (str, optional) โ€“ Hash or checksum algorithm to use to validate shards. Defaults to None.

  • batch_size (int, optional) โ€“ Hint the batch size that will be used on each deviceโ€™s DataLoader. Defaults to None.