Hashing#
Streaming supports a variety of hash and checksum algorithms to verify data integrity.
We optionally hash shards while serializing a streaming dataset, saving the resulting hashes in the index, which is written last. After the dataset is finished being written, we may hash the index file itself, the results of which must be stored elsewhere. Hashing during writing is controlled by the Writer argument hashes: Optional[List[str]] = None
. We generally weakly recommend writing streaming datasets with one cryptographic hash algorithm and one fast hash algorithm for offline dataset validation in the future.
Then, we optionally validate shard hashes upon download while reading a streaming dataset. Hashing during reading is controlled separately by the StreamingDataset argument validate_hash: Optional[str] = None
. We recommend reading streaming datasets for training purposes without validating hashes because of the extra cost in time and computation.
Available cryptographic hash functions:
Hash |
Digest Bytes |
---|---|
blake2b |
64 |
blake2s |
32 |
md5 |
16 |
sha1 |
20 |
sha224 |
28 |
sha256 |
32 |
sha384 |
48 |
sha512 |
64 |
sha3_224 |
28 |
sha3_256 |
32 |
sha3_384 |
48 |
sha3_512 |
64 |
Available non-cryptographic hash functions:
Hash |
Digest Bytes |
---|---|
xxh32 |
4 |
xxh64 |
8 |
xxh128 |
16 |
xxh3_64 |
8 |
xxh3_128 |
16 |