Compression#
Compression allows us to store and download a small dataset and use a large dataset. Compression is beneficial for text, often compressing shards to a third of the original size, whereas it is marginally helpful for other modalities like images. Compression operates based on shards. We provide several compression algorithms, but in practice, Zstandard
is a safe bet across the entire time-size Pareto frontier. The higher the quality level, the higher the compression ratio. However, using higher compression levels will impact the compression speed.
Table of supported compression algorithms:
Name |
Code |
Min Level |
Default Level |
Max Level |
---|---|---|---|---|
br |
0 |
11 |
11 |
|
bz2 |
1 |
9 |
9 |
|
gz |
0 |
9 |
9 |
|
snappy |
โ |
โ |
โ |
|
zstd |
1 |
3 |
22 |
The compression algorithm to use, if any, is specified by passing code
or code:level
as a string to the Writer. Decompression happens behind the scenes in the Stream (inside StreamingDataset) as shards are downloaded. Control whether to keep the compressed version of shards by setting the keep_zip
flag in the specific Streamโs init or for all streams in StreamingDataset init.