StreamingDataset helps to make training on large datasets from cloud storage as fast, cheap, and scalable as possible. It’s specially designed for multi-node, distributed training for large models—maximizing correctness guarantees, performance, and ease of use. Now, you can efficiently train anywhere, independent of your training data location. Just stream in the data you need, when you need it.
StreamingDataset is compatible with any data type, including images, text, video, and multimodal data. With support for major cloud storage providers (AWS, OCI, GCS, Azure, and any S3 compatible object store such as Cloudflare R2, Coreweave, Backblaze b2, etc. ) and designed as a drop-in replacement for your PyTorch IterableDataset class, StreamingDataset seamlessly integrates into your existing training workflows.
from torch.utils.data import DataLoader
from streaming import StreamingDataset
dataloader = DataLoader(dataset=StreamingDataset(remote='s3://...'))
🔑 Key Features#
True Determinism: Samples are in the same order regardless of the number of GPUs, nodes, or CPU workers. This makes it easier to reproduce and debug training runs and loss spikes and load a checkpoint trained on 64 GPUs and debug on 8 GPUs with reproducibility.
Instant Mid-Epoch Resumption: Resume training in seconds, not hours, in the middle of a long training run. Minimizing resumption latency can save thousands of dollars in egress fees and idle GPU compute time compared to existing solutions.
High throughput: Our MDS format cuts extraneous work to the bone, resulting in ultra-low sample latency and higher throughput compared to alternatives for workloads bottlenecked by the dataloader.
Equal Convergence: Model convergence from using StreamingDataset is just as good as using local disk, thanks to our shuffling algorithm. StreamingDataset shuffles across all samples assigned to a node, whereas alternative solutions only shuffle samples in a smaller pool (within a single process).
Random access: Access the data you need when you need it. Even if a sample isn’t downloaded yet, you can access
dataset[i]to get sample
Numpy style indexing: Fetch data on the fly by providing a NumPy style indexing to
Seamless data mixing: During streaming, the different datasets are streamed, shuffled, and mixed seamlessly just-in-time.
Disk usage limits: Dynamically delete least recently used shards in order to keep disk usage under a specified limit.
Streaming is part of the broader Machine Learning community, and we welcome any contributions, pull requests, and issues.