- class streaming.Stream(*, remote=None, local=None, split=None, proportion=None, repeat=None, choose=None, download_retry=None, download_timeout=None, validate_hash=None, keep_zip=None)#
A dataset, or sub-dataset if mixing, from which we stream/cache samples.
We initialize a StreamingDataset with one or more Streams. Streams may be resampled to achieve different mixtures of samples.
Stream init takes three kinds of arguments:
At least one of
localmust exist. If no
remote, the data must be local. If no
local, we cache to a temp directory.
At most one of
choosemay exist. If provided one of these, we derive the rest. Note that
choose(absolute) are mutually incompatible – you must entirely use one or the other (or neither) for all sub-datasets. If none are provided for all streams and
epoch_sizeis unspecified, then each sample from each stream is seen once per epoch. If none are provided for all streams and
epoch_sizeis specified, then streams are sampled in proportion to their size.
The remaining arguments are optional knobs for controlling downloading behavior and default to
None, they take a default value provided to or by the StreamingDataset init.
remote (str, optional) – Remote path or directory to download the dataset from. If
None, its data must exist locally. Defaults to
local (str, optional) – Local working directory to download shards to. This is where shards are cached while they are being used. Uses a temp directory if not set. Defaults to
split (str, optional) – Which dataset split to use, if any. If provided, we stream from/to the
local. Defaults to
proportion (float, optional) – How much to upsample or downsample this sub-dataset, as the proportion of the total combined dataset that consists of this sub-dataset. If using proportions, all sub-datasets provided together to the StreamingDataset init must define their proportions. The total combined number of samples is either the StreamingDataset argument “epoch_size” if provided, or kept the same total size as the underlying data if not. If provided, must be non-negative. Defaults to
repeat (float, optional) – How much to upsample or downsample this sub-dataset, as a multipler on the number of samples. If provided, must be non-negative. Defaults to
choose (int, optional) – How much to upsample or downsample this sub-dataset, as the exact number of resulting samples. If provided, must be non-negative. Defaults to
download_retry (int, optional) – Number of download re-attempts before giving up. Defaults to
download_timeout (float, optional) – Number of seconds to wait for a shard to download before raising an exception. Defaults to
validate_hash (str, optional) – Optional hash or checksum algorithm to use to validate shards. Defaults to
keep_zip (bool, optional) – Whether to keep or delete the compressed form when decompressing downloaded shards. If
False, keep if and only if remote is local or no remote. Defaults to
Apply defaults, setting any unset fields.
We use pairs of (name, _name) in order to make type checking happy.
default (Self) – Stream containing default values for all optional fields.
- classmethod apply_weights(streams, samples_per_stream, choose_per_epoch, seed)#
Given samples per stream, derive each stream’s proportion/repeat/samples.
Modifies streams to save the derived weights.
int – Number of samples to draw per epoch.
Get the size of the index file in bytes.
int – Size in bytes.
Load this Stream’s index, retrieving its shard readers.
world (World) – Distributed context.
`List[Reader] – Shard readers.
Ensure (download, validate, extract, etc.) that we have the given shard.
shard (Reader) – Which shard.
int – Change in cache usage.
- set_up_local(shards, cache_usage_per_shard)#
Bring a local directory into a consistent state, getting which shards are present.
shards (List[Reader]) – List of this stream’s shards.
cache_usage_per_shard (NDArray[np.int64]) – Cache usage per shard of this stream.