Reader#
- class streaming.base.format.Reader(dirname, split, compression, hashes, samples, size_limit)[source]#
- Provides random access to the samples of a shard. - Parameters
- dirname (str) β Local dataset directory. 
- split (str, optional) β Which dataset split to use, if any. 
- compression (str, optional) β Optional compression or compression:level. 
- hashes (List[str]) β Optional list of hash algorithms to apply to shard files. 
- samples (int) β Number of samples in this shard. 
- size_limit (Union[int, str], optional) β Optional shard size limit, after which point to start a new shard. If None, puts everything in one shard. Can specify bytes in human-readable format as well, for example - "100kb"for 100 kilobyte (100*1024) and so on.
 
 - abstract decode_sample(data)[source]#
- Decode a sample dict from bytes. - Parameters
- data (bytes) β The sample encoded as bytes. 
- Returns
- Dict[str, Any] β Sample dict. 
 
 - evict()[source]#
- Remove all files belonging to this shard. - Returns
- int β Bytes evicted from cache. 
 
 - get_item(idx)[source]#
- Get the sample at the index. - Parameters
- idx (int) β Sample index. 
- Returns
- Dict[str, Any] β Sample dict. 
 
 - get_max_size()[source]#
- Get the full size of this shard. - βMaxβ in this case means both the raw (decompressed) and zip (compressed) versions are resident (assuming it has a zip form). This is the maximum disk usage the shard can reach. When compressed was used, even if keep_zip is - False, the zip form must still be resident at the same time as the raw form during shard decompression.- Returns
- int β Size in bytes. 
 
 - get_persistent_size(keep_zip)[source]#
- Get the persistent size of this shard. - βPersistentβ in this case means whether both raw and zip are present is subject to keep_zip. If we are not keeping zip files after decompression, they donβt count to the shardβs persistent size on disk. - Parameters
- keep_zip (bool) β Whether to keep zip files after decompressing. 
- Returns
- int β Size in bytes. 
 
 - get_raw_size()[source]#
- Get the raw (uncompressed) size of this shard. - Returns
- int β Size in bytes. 
 
 - abstract get_sample_data(idx)[source]#
- Get the raw sample data at the index. - Parameters
- idx (int) β Sample index. 
- Returns
- bytes β Sample data. 
 
 - get_zip_size()[source]#
- Get the zip (compressed) size of this shard, if compression was used. - Returns
- Optional[int] β Size in bytes, or - Noneif does not exist.
 
 - set_up_local(listing, safe_keep_zip)[source]#
- Bring what shard files are present to a consistent state, returning whether present. - Parameters
- listing (Set[str]) β The listing of all files under dirname/[split/]. This is listed once and then saved because there could potentially be very many shard files. 
- safe_keep_zip (bool) β Whether to keep zip files when decompressing. Possible when compression was used. Necessary when local is the remote or there is no remote. 
 
- Returns
- bool β Whether the shard is present. 
 
 - property size#
- Get the number of samples in this shard. - Returns
- int β Sample count.