π Quick Start#
Start training your model with Streaming in just a few steps!
Convert your raw dataset into one of our supported file formats. Here, we convert an image dataset to MDS (Mosaic Data Shard) format.
import numpy as np from PIL import Image from shutil import rmtree from uuid import uuid4 from streaming import MDSWriter # Local or remote directory path to store the output compressed files. out_root = 'dirname' # A dictionary of input fields to an Encoder/Decoder type columns = { 'uuid': 'str', 'img': 'jpeg', 'clf': 'int' } # Compression algorithm name compression = 'zstd' # Generate random images and classes samples = [ { 'uuid': str(uuid4()), 'img': Image.fromarray(np.random.randint(0, 256, (32, 48, 3), np.uint8)), 'clf': np.random.randint(10), } for _ in range(1000) ] # Use `MDSWriter` to iterate through the input data and write to a collection of `.mds` files. with MDSWriter(out=out_root, columns=columns, compression=compression) as out: for sample in samples: out.write(sample)
Replace the original
torch.utils.data.IterableDataset
with your newstreaming.StreamingDataset
. Point it to the dataset written out above, and specify thebatch_size
to StreamingDataset and the DataLoader.from torch.utils.data import DataLoader from streaming import StreamingDataset # Remote directory where dataset is stored, from above remote_dir = 's3://path/to/dataset' # Local directory where dataset is cached during training local_dir = '/local/cache/path' dataset = StreamingDataset(local=local_dir, remote=remote_dir, batch_size=1, split=None, shuffle=True) # Create PyTorch DataLoader dataloader = DataLoader(dataset, batch_size=1)
Thatβs it! For additional details on using Streaming, check out the Main Concepts page and How-to Guides.
We also have starter code for the following popular datasets, which can be found in the streaming
directory:
Dataset |
Task |
Read |
Write |
---|---|---|---|
LAION-400M |
Text and image |
||
WebVid |
Text and video |
||
C4 |
Text |
||
EnWiki |
Text |
||
Pile |
Text |
||
ADE20K |
Image segmentation |
||
CIFAR10 |
Image classification |
||
COCO |
Image classification |
||
ImageNet |
Image classification |
To start training on these datasets:
Convert raw data into .mds format using the corresponding script from the
convert
directory.
For example:
$ python -m streaming.multimodal.convert.webvid --in <CSV file> --out <MDS output directory>
Import dataset class to start training the model.
from streaming.multimodal import StreamingInsideWebVid
dataset = StreamingInsideWebVid(local=local, remote=remote, batch_size=1, shuffle=True)
Happy training!