Dataset Conversion to MDS Format#
If you have not read the Dataset Format guide and Dataset Conversion guide, then we highly recommend you do so before you start.
To use Streaming Dataset we must first convert the dataset from its native format to MosaicMLβs Streaming Dataset format called Mosaic Dataset Shard (MDS). Once in MDS format, we can access the dataset from the local file system (disk network attached storage, etc.) or object store (GCS, OCS, S3, etc.). From object store, data can be streamed to train deep learning models and it all just works.
Convert a raw data into MDS format#
Letβs look at the steps one needs to perform to convert their raw data into an MDS format.
Get the raw dataset, either you can download all locally or create an iterator which downloads on the fly.
For the raw dataset, you need some form of iterator which fetches one sample at a time.
Convert the raw sample in the form of
column
field.Instantiate MDSWriter and call the
write
method to write a raw sample one at a time.
Checkout the user guide section which contains a simplistic example for the data conversion using single process. For multiprocess dataset conversion example, checkout this tutorial.
Weβve already created conversion scripts that can be used to convert popular public datasets to MDS format. Please see below for usage instructions.
Spark Dataframe Conversion Examples#
Users can read datasets of any formats that Spark supports and convert the Spark dataframe to a Mosaic Streaming dataset. More specifically,
We enable converting a Spark DataFrame into an MDS format via the utility function dataframeToMDS. This utility function is flexible and supports a callable function, allowing modifications to the original data format. The function iterates over the callable, processes the modified data, and writes it in MDS format. For instance, it can be used with a tokenizer callable function that yields tokens as output.
Users are recommended to refer to the starting example Jupyter notebook which demonstrates a complete workflow. It illustrates how to use Spark to read raw data into a Spark DataFrame and then convert it into the MDS format via the
dataframeToMDS
function. In that tutorial, we also demonstrate the option to pass in a preprocessing tokenization job to the converter, which can be useful if materializing the intermediate dataframe is time consuming or taking extra development.
NLP Dataset Conversion Examples#
C4: Colossal, Cleaned, Common Crawl dataset#
Run the c4.py script as shown below. The script downloads the raw format with
train
andval
splits from HuggingFace hub and converts to StreamingDataset MDS format into their own split directories. For more advanced use cases, please see the supported arguments for c4.py and modify as necessary.python c4.py --out_root <local or remote directory path to save output MDS shard files>
Wikipedia#
Download English Wikipedia 2020-01-01 from here.
Unzip the file
results_text.zip
as shown below.unzip results_text.zip
Listing the output should show the following directory structure:
βββ eval.txt βββ part-00000-of-00500 βββ part-00001-of-00500 βββ part-00002-of-00500 βββ ..... βββ part-00498-of-00500 βββ part-00499-of-00500
Run the enwiki_text.py script. The script converts the
train
andval
dataset splits into their own split directories. For more advanced use cases, please see the supported arguments for enwiki_text.py and modify as necessary.python enwiki_text.py --in_root <Above directory> --out_root <local or remote directory path to save output MDS shard files>
Pile#
Download the Pile dataset from here.
Listing the output should show the following directory structure:
βββ SHA256SUMS.txt βββ test.jsonl.zst βββ train βΒ Β βββ 00.jsonl.zst βΒ Β βββ 01.jsonl.zst βΒ Β βββ 02.jsonl.zst βΒ Β βββ 03.jsonl.zst βΒ Β βββ ..... βΒ Β βββ 28.jsonl.zst βΒ Β βββ 29.jsonl.zst βββ val.jsonl.zst
Run the pile.py script. The script converts the
train
,test
, andval
dataset splits into their own split directories. For more advanced use cases, please see the supported arguments for pile.py and modify as necessary.python pile.py --in_root <Above directory> --out_root <local or remote directory path to save output MDS shard files>
Vision Dataset Conversion Examples#
ADE20K#
Download the ADE20K dataset from here.
Listing the output should show the following directory structure:
βββ annotations β βββ training β βββ validation βββ images βββ training βββ validation
Run the ade20k.py script as shown below. The script converts the
train
andval
dataset splits into their own directories. For advanced use cases, please see the supported arguments for ade20k.py and modify according as necessary.python ade20k.py --in_root <Above directory> --out_root <local or remote directory path to save output MDS shard files>
CIFAR10#
Run the cifar10.py script as shown below. The CIFAR10 dataset will be automatically downloaded if it doesnβt exist locally. For advanced use cases, please see the supported arguments for cifar10.py and modify as necessary.
python cifar10.py --in_root <Above directory> --out_root <local or remote directory path to save output MDS shard files>
MS-COCO#
Download the COCO 2017 dataset from here. Please download both the COCO images and annotations and unzip the files as shown below.
mkdir coco wget -c http://images.cocodataset.org/annotations/annotations_trainval2017.zip wget -c http://images.cocodataset.org/zips/train2017.zip wget -c http://images.cocodataset.org/zips/val2017.zip unzip annotations_trainval2017.zip unzip train2017.zip unzip val2017.zip rm annotations_trainval2017.zip rm train2017.zip rm val2017.zip
Listing the output should show the following directory structure:
βββ annotations β βββ instances_train2017.json β βββ instances_val2017.json βββ train2017 β βββ 000000391895.jpg | |ββ ... βββ val2017 β βββ 000000000139.jpg | |ββ ...
Run the coco.py script as shown below. The script converts the
train
andval
dataset splits into their own directories. For advanced use cases, please seet the supported arguments for coco.py and modify as necessary.python coco.py --in_root <Above directory> --out_root <local or remote directory path to save output MDS shard files>
ImageNet#
Download the ImageNet dataset from here. Two files are needed,
ILSVRC2012_img_train.tar
for training andILSVRC2012_img_val.tar
for validation. Next untar both the files as shown below.mkdir val mv ILSVRC2012_img_val.tar val/ tar -xvf ILSVRC2012_img_val.tar -C val/ rm ILSVRC2012_img_val.tar mkdir train mv ILSVRC2012_img_train.tar train/ tar -xvf ILSVRC2012_img_train.tar -C train/ rm ILSVRC2012_img_train.tar
Listing the output should show the following directory structure:
βββ train/ βββ n01440764 β βββ n01440764_10026.JPEG β βββ n01440764_10027.JPEG β βββ ...... βββ ...... βββ val/ βββ n01440764 β βββ ILSVRC2012_val_00000293.JPEG β βββ ILSVRC2012_val_00002138.JPEG β βββ ...... βββ ......
Run the imagenet.py script as shown below. The script converts the
train
andval
dataset splits into their own directories. For advanced uses cases, please see the supported arguments for imagenet.py and modify as needed.python imagenet.py --in_root <Above directory> --out_root <local or remote directory path to save output MDS shard files>
Multimodal Dataset Conversion Examples#
LAION-400M#
1. Install dependencies
Install package img2dataset
.
# Used for crawling.
pip3 install img2dataset==1.41.0
# Optional performance monitoring.
apt install bwm-ng htop iotop
2. Get the streaming code
git clone https://github.com/mosaicml/streaming/
cd streaming/
3. Download metadata from the-eye.eu (parquet format)
./streaming/multimodal/convert/laion/laion400m/download_meta.sh
4. Download data from the web (into parquet format, converting to mds format)
The img2dataset download script saves samples in parquet files.
./streaming/multimodal/convert/laion/laion400m/download_data.sh
At the same time, do our conversion and uploading which uses MDS (you will want to run them at the same time, or disk usage can get excessive):
./streaming/multimodal/convert/laion/laion400m/convert_and_upload.sh
Optional For system monitoring, run the below commands:
Monitor network i/o:
bwm-ng
Monitor CPU usage:
htop
Monitor disk i/o:
iotop
Monitor disk usage:
df -h
WebVid#
Single MDS dataset conversion#
Create an MDS dataset from a CSV file containing video URLs (downloads the videos).
Navigate to the WebVid download section, where you willΒ find 2.5M and 10M dataset splits. Download each CSV split you want to process.
Run the crawl_webvid.py script with minimum required arguments as shown below
Βpython crawl_webvid.py --in <CSV filepath> --out_root <Output MDS directory>Β
Multiple MDS sub-dataset conversion#
Create multiple MDS sub-datasets from a CSV file containing video URLs and a list of substrings to match against (downloads the videos).
Navigate to the WebVid download section, where you will find 2.5M and 10M dataset splits. Download each CSV split you want to process.
Run the crawl_webvid_subsets.py script with minimum required arguments as shown below. The script also supports an optional arg
filter
, which takes a comma-separated list of keywords to filter into sub-datasets.python crawl_webvid_subsets.py --in <CSV filepath> --out_root <Output MDS directory>
Split out MDS datasets column#
Iterate an existing MDS dataset containing videos, creating a new MDS dataset without video contents embedded in it, instead, add a video filepath in a new MDS dataset where the video files (MP4) are stored separately.
Run the extract_webvid_videos.py script with minimum required arguments as shown below
python extract_webvid_videos.py --in <Input mp4-inside MDS dataset directory> --out_mds <Output mp4-outside MDS dataset directory> --out_mp4 <Output mp4 videos directory>