Dataset Conversion to MDS Format#

If you have not read the Dataset Format guide and Dataset Conversion guide, then we highly recommend you do so before you start.

To use Streaming Dataset we must first convert the dataset from its native format to MosaicML’s Streaming Dataset format called Mosaic Dataset Shard (MDS). Once in MDS format, we can access the dataset from the local file system (disk network attached storage, etc.) or object store (GCS, OCS, S3, etc.). From object store, data can be streamed to train deep learning models and it all just works.

Convert a raw data into MDS format#

Let’s look at the steps one needs to perform to convert their raw data into an MDS format.

  1. Get the raw dataset, either you can download all locally or create an iterator which downloads on the fly.

  2. For the raw dataset, you need some form of iterator which fetches one sample at a time.

  3. Convert the raw sample in the form of column field.

  4. Instantiate MDSWriter and call the write method to write a raw sample one at a time.

Checkout the user guide section which contains a simplistic example for the data conversion using single process. For multiprocess dataset conversion example, checkout this tutorial.

We’ve already created conversion scripts that can be used to convert popular public datasets to MDS format. Please see below for usage instructions.

Spark Dataframe Conversion Examples#

Users can read datasets of any formats that Spark supports and convert the Spark dataframe to a Mosaic Streaming dataset. More specifically,

  1. We enable converting a Spark DataFrame into an MDS format via the utility function dataframeToMDS. This utility function is flexible and supports a callable function, allowing modifications to the original data format. The function iterates over the callable, processes the modified data, and writes it in MDS format. For instance, it can be used with a tokenizer callable function that yields tokens as output.

  2. Users are recommended to refer to the starting example Jupyter notebook which demonstrates a complete workflow. It illustrates how to use Spark to read raw data into a Spark DataFrame and then convert it into the MDS format via the dataframeToMDS function. In that tutorial, we also demonstrate the option to pass in a preprocessing tokenization job to the converter, which can be useful if materializing the intermediate dataframe is time consuming or taking extra development.

NLP Dataset Conversion Examples#

C4: Colossal, Cleaned, Common Crawl dataset#

  1. Run the c4.py script as shown below. The script downloads the raw format with train and val splits from HuggingFace hub and converts to StreamingDataset MDS format into their own split directories. For more advanced use cases, please see the supported arguments for c4.py and modify as necessary.

    python c4.py --out_root <local or remote directory path to save output MDS shard files>
    

Wikipedia#

  1. Download English Wikipedia 2020-01-01 from here.

  2. Unzip the file results_text.zip as shown below.

    unzip results_text.zip
    

    Listing the output should show the following directory structure:

    β”œβ”€β”€ eval.txt
    β”œβ”€β”€ part-00000-of-00500
    β”œβ”€β”€ part-00001-of-00500
    β”œβ”€β”€ part-00002-of-00500
    β”œβ”€β”€ .....
    β”œβ”€β”€ part-00498-of-00500
    └── part-00499-of-00500
    
  3. Run the enwiki_text.py script. The script converts the train and val dataset splits into their own split directories. For more advanced use cases, please see the supported arguments for enwiki_text.py and modify as necessary.

    python enwiki_text.py --in_root <Above directory> --out_root <local or remote directory path to save output MDS shard files>
    

Pile#

  1. Download the Pile dataset from here.

    Listing the output should show the following directory structure:

    β”œβ”€β”€ SHA256SUMS.txt
    β”œβ”€β”€ test.jsonl.zst
    β”œβ”€β”€ train
    β”‚Β Β  β”œβ”€β”€ 00.jsonl.zst
    β”‚Β Β  β”œβ”€β”€ 01.jsonl.zst
    β”‚Β Β  β”œβ”€β”€ 02.jsonl.zst
    β”‚Β Β  β”œβ”€β”€ 03.jsonl.zst
    β”‚Β Β  β”œβ”€β”€ .....
    β”‚Β Β  β”œβ”€β”€ 28.jsonl.zst
    β”‚Β Β  └── 29.jsonl.zst
    └── val.jsonl.zst
    
  2. Run the pile.py script. The script converts the train, test, and val dataset splits into their own split directories. For more advanced use cases, please see the supported arguments for pile.py and modify as necessary.

    python pile.py --in_root <Above directory> --out_root <local or remote directory path to save output MDS shard files>
    

Vision Dataset Conversion Examples#

ADE20K#

  1. Download the ADE20K dataset from here.

  2. Listing the output should show the following directory structure:

    β”œβ”€β”€ annotations
    β”‚   β”œβ”€β”€ training
    β”‚   └── validation
    └── images
        β”œβ”€β”€ training
        └── validation
    
  3. Run the ade20k.py script as shown below. The script converts the train and val dataset splits into their own directories. For advanced use cases, please see the supported arguments for ade20k.py and modify according as necessary.

    python ade20k.py --in_root <Above directory> --out_root <local or remote directory path to save output MDS shard files>
    

CIFAR10#

  1. Run the cifar10.py script as shown below. The CIFAR10 dataset will be automatically downloaded if it doesn’t exist locally. For advanced use cases, please see the supported arguments for cifar10.py and modify as necessary.

    python cifar10.py --in_root <Above directory> --out_root <local or remote directory path to save output MDS shard files>
    

MS-COCO#

  1. Download the COCO 2017 dataset from here. Please download both the COCO images and annotations and unzip the files as shown below.

    mkdir coco
    wget -c http://images.cocodataset.org/annotations/annotations_trainval2017.zip
    wget -c http://images.cocodataset.org/zips/train2017.zip
    wget -c http://images.cocodataset.org/zips/val2017.zip
    
    unzip annotations_trainval2017.zip
    unzip train2017.zip
    unzip val2017.zip
    
    rm annotations_trainval2017.zip
    rm train2017.zip
    rm val2017.zip
    

    Listing the output should show the following directory structure:

    β”œβ”€β”€ annotations
    β”‚   β”œβ”€β”€ instances_train2017.json
    β”‚   └── instances_val2017.json
    β”œβ”€β”€ train2017
    β”‚   β”œβ”€β”€ 000000391895.jpg
    |   |── ...
    └── val2017
    β”‚   β”œβ”€β”€ 000000000139.jpg
    |   |── ...
    
  2. Run the coco.py script as shown below. The script converts the train and val dataset splits into their own directories. For advanced use cases, please seet the supported arguments for coco.py and modify as necessary.

    python coco.py --in_root <Above directory> --out_root <local or remote directory path to save output MDS shard files>
    

ImageNet#

  1. Download the ImageNet dataset from here. Two files are needed, ILSVRC2012_img_train.tar for training and ILSVRC2012_img_val.tar for validation. Next untar both the files as shown below.

    mkdir val
    mv ILSVRC2012_img_val.tar val/
    tar -xvf ILSVRC2012_img_val.tar -C val/
    rm ILSVRC2012_img_val.tar
    
    mkdir train
    mv ILSVRC2012_img_train.tar train/
    tar -xvf ILSVRC2012_img_train.tar -C train/
    rm ILSVRC2012_img_train.tar
    

    Listing the output should show the following directory structure:

    β”œβ”€β”€ train/
      β”œβ”€β”€ n01440764
      β”‚   β”œβ”€β”€ n01440764_10026.JPEG
      β”‚   β”œβ”€β”€ n01440764_10027.JPEG
      β”‚   β”œβ”€β”€ ......
      β”œβ”€β”€ ......
    β”œβ”€β”€ val/
      β”œβ”€β”€ n01440764
      β”‚   β”œβ”€β”€ ILSVRC2012_val_00000293.JPEG
      β”‚   β”œβ”€β”€ ILSVRC2012_val_00002138.JPEG
      β”‚   β”œβ”€β”€ ......
      β”œβ”€β”€ ......
    
  2. Run the imagenet.py script as shown below. The script converts the train and val dataset splits into their own directories. For advanced uses cases, please see the supported arguments for imagenet.py and modify as needed.

    python imagenet.py --in_root <Above directory> --out_root <local or remote directory path to save output MDS shard files>
    

Multimodal Dataset Conversion Examples#

LAION-400M#

1. Install dependencies Install package img2dataset.

# Used for crawling.
pip3 install img2dataset==1.41.0

# Optional performance monitoring.
apt install bwm-ng htop iotop

2. Get the streaming code

git clone https://github.com/mosaicml/streaming/
cd streaming/

3. Download metadata from the-eye.eu (parquet format)

./streaming/multimodal/convert/laion/laion400m/download_meta.sh

4. Download data from the web (into parquet format, converting to mds format)

The img2dataset download script saves samples in parquet files.

./streaming/multimodal/convert/laion/laion400m/download_data.sh

At the same time, do our conversion and uploading which uses MDS (you will want to run them at the same time, or disk usage can get excessive):

./streaming/multimodal/convert/laion/laion400m/convert_and_upload.sh

Optional For system monitoring, run the below commands:

  • Monitor network i/o: bwm-ng

  • Monitor CPU usage: htop

  • Monitor disk i/o: iotop

  • Monitor disk usage: df -h

WebVid#

Single MDS dataset conversion#

Create an MDS dataset from a CSV file containing video URLs (downloads the videos).

  1. Navigate to the WebVid download section, where you willΒ find 2.5M and 10M dataset splits. Download each CSV split you want to process.

  2. Run the crawl_webvid.py script with minimum required arguments as shown below

    Β 
    python crawl_webvid.py --in <CSV filepath> --out_root <Output MDS directory>Β 
    

Multiple MDS sub-dataset conversion#

Create multiple MDS sub-datasets from a CSV file containing video URLs and a list of substrings to match against (downloads the videos).

  1. Navigate to the WebVid download section, where you will find 2.5M and 10M dataset splits. Download each CSV split you want to process.

  2. Run the crawl_webvid_subsets.py script with minimum required arguments as shown below. The script also supports an optional arg filter, which takes a comma-separated list of keywords to filter into sub-datasets.

    python crawl_webvid_subsets.py --in <CSV filepath> --out_root <Output MDS directory>
    

Split out MDS datasets column#

Iterate an existing MDS dataset containing videos, creating a new MDS dataset without video contents embedded in it, instead, add a video filepath in a new MDS dataset where the video files (MP4) are stored separately.

  1. Run the extract_webvid_videos.py script with minimum required arguments as shown below

    python extract_webvid_videos.py --in <Input mp4-inside MDS dataset directory> --out_mds <Output mp4-outside MDS dataset directory> --out_mp4 <Output mp4 videos directory>