π Quick Start#
Start training your model with Streaming in just a few steps!
- Convert your raw dataset into one of our supported file formats. Here, we convert an image dataset to MDS (Mosaic Data Shard) format. - import numpy as np from PIL import Image from uuid import uuid4 from streaming import MDSWriter # Local or remote directory path to store the output compressed files. out_root = 'dirname' # A dictionary of input fields to an Encoder/Decoder type columns = { 'uuid': 'str', 'img': 'jpeg', 'clf': 'int' } # Compression algorithm name compression = 'zstd' # Generate random images and classes samples = [ { 'uuid': str(uuid4()), 'img': Image.fromarray(np.random.randint(0, 256, (32, 48, 3), np.uint8)), 'clf': np.random.randint(10), } for _ in range(1000) ] # Use `MDSWriter` to iterate through the input data and write to a collection of `.mds` files. with MDSWriter(out=out_root, columns=columns, compression=compression) as out: for sample in samples: out.write(sample) 
- Replace the original - torch.utils.data.IterableDatasetwith your new- streaming.StreamingDataset. Point it to the dataset written out above, and specify the- batch_sizeto StreamingDataset and the DataLoader.- from torch.utils.data import DataLoader from streaming import StreamingDataset # Remote directory where dataset is stored, from above remote_dir = 's3://path/to/dataset' # Local directory where dataset is cached during training local_dir = '/local/cache/path' dataset = StreamingDataset(local=local_dir, remote=remote_dir, batch_size=1, split=None, shuffle=True) # Create PyTorch DataLoader dataloader = DataLoader(dataset, batch_size=1) 
Thatβs it! For additional details on using Streaming, check out the Main Concepts page and How-to Guides.
We also have starter code for the following popular datasets, which can be found in the streaming directory:
| Dataset | Task | Read | Write | 
|---|---|---|---|
| LAION-400M | Text and image | ||
| WebVid | Text and video | ||
| C4 | Text | ||
| EnWiki | Text | ||
| Pile | Text | ||
| ADE20K | Image segmentation | ||
| CIFAR10 | Image classification | ||
| COCO | Image classification | ||
| ImageNet | Image classification | 
To start training on these datasets:
- Convert raw data into .mds format using the corresponding script from the - convertdirectory.
For example:
$ python -m streaming.multimodal.convert.webvid --in <CSV file> --out <MDS output directory>
- Import dataset class to start training the model. 
from streaming.multimodal import StreamingInsideWebVid
dataset = StreamingInsideWebVid(local=local, remote=remote, batch_size=1, shuffle=True)
Happy training!