Reader#
- class streaming.base.format.Reader(dirname, split, compression, hashes, samples, size_limit)[source]#
Provides random access to the samples of a shard.
- Parameters
dirname (str) β Local dataset directory.
split (str, optional) β Which dataset split to use, if any.
compression (str, optional) β Optional compression or compression:level.
hashes (List[str]) β Optional list of hash algorithms to apply to shard files.
samples (int) β Number of samples in this shard.
size_limit (Union[int, str], optional) β Optional shard size limit, after which point to start a new shard. If None, puts everything in one shard. Can specify bytes in human-readable format as well, for example
"100kb"
for 100 kilobyte (100*1024) and so on.
- abstract decode_sample(data)[source]#
Decode a sample dict from bytes.
- Parameters
data (bytes) β The sample encoded as bytes.
- Returns
Dict[str, Any] β Sample dict.
- evict()[source]#
Remove all files belonging to this shard.
- Returns
int β Bytes evicted from cache.
- get_item(idx)[source]#
Get the sample at the index.
- Parameters
idx (int) β Sample index.
- Returns
Dict[str, Any] β Sample dict.
- get_max_size()[source]#
Get the full size of this shard.
βMaxβ in this case means both the raw (decompressed) and zip (compressed) versions are resident (assuming it has a zip form). This is the maximum disk usage the shard can reach. When compressed was used, even if keep_zip is
False
, the zip form must still be resident at the same time as the raw form during shard decompression.- Returns
int β Size in bytes.
- get_persistent_size(keep_zip)[source]#
Get the persistent size of this shard.
βPersistentβ in this case means whether both raw and zip are present is subject to keep_zip. If we are not keeping zip files after decompression, they donβt count to the shardβs persistent size on disk.
- Parameters
keep_zip (bool) β Whether to keep zip files after decompressing.
- Returns
int β Size in bytes.
- get_raw_size()[source]#
Get the raw (uncompressed) size of this shard.
- Returns
int β Size in bytes.
- abstract get_sample_data(idx)[source]#
Get the raw sample data at the index.
- Parameters
idx (int) β Sample index.
- Returns
bytes β Sample data.
- get_zip_size()[source]#
Get the zip (compressed) size of this shard, if compression was used.
- Returns
Optional[int] β Size in bytes, or
None
if does not exist.
- set_up_local(listing, safe_keep_zip)[source]#
Bring what shard files are present to a consistent state, returning whether present.
- Parameters
listing (Set[str]) β The listing of all files under dirname/[split/]. This is listed once and then saved because there could potentially be very many shard files.
safe_keep_zip (bool) β Whether to keep zip files when decompressing. Possible when compression was used. Necessary when local is the remote or there is no remote.
- Returns
bool β Whether the shard is present.
- property size#
Get the number of samples in this shard.
- Returns
int β Sample count.