Compression#

Compression allows us to store and download a small dataset and use a large dataset. Compression is beneficial for text, often compressing shards to a third of the original size, whereas it is marginally helpful for other modalities like images. Compression operates based on shards. We provide several compression algorithms, but in practice, Zstandard is a safe bet across the entire time-size Pareto frontier. The higher the quality level, the higher the compression ratio. However, using higher compression levels will impact the compression speed.

Table of supported compression algorithms:

Name

Code

Min Level

Default Level

Max Level

Brotli

br

0

11

11

Bzip2

bz2

1

9

9

Gzip

gz

0

9

9

Snappy

snappy

โ€“

โ€“

โ€“

Zstandard

zstd

1

3

22

The compression algorithm to use, if any, is specified by passing code or code:level as a string to the Writer. Decompression happens behind the scenes in the Stream (inside StreamingDataset) as shards are downloaded. Control whether to keep the compressed version of shards by setting the keep_zip flag in the specific Streamโ€™s init or for all streams in StreamingDataset init.