XSVWriter#
- class streaming.XSVWriter(*, columns, separator, newline='\n', out, keep_local=False, compression=None, hashes=None, size_limit=67108864, **kwargs)[source]#
Writes a streaming XSV dataset.
- Parameters
separator (str) – String used to separate columns.
newline (str) – Newline character inserted between samples. Defaults to
\\n
.out (str | Tuple[str, str]) –
Output dataset directory to save shard files.
If
out
is a local directory, shard files are saved locally.If
out
is a remote directory, a local temporary directory is created to cache the shard files and then the shard files are uploaded to a remote location. At the end, the temp directory is deleted once shards are uploaded.If
out
is a tuple of(local_dir, remote_dir)
, shard files are saved in the local_dir and also uploaded to a remote location.
keep_local (bool) – If the dataset is uploaded, whether to keep the local dataset directory or remove it after uploading. Defaults to
False
.compression (str, optional) – Optional compression or compression:level. Defaults to
None
.hashes (List[str], optional) – Optional list of hash algorithms to apply to shard files. Defaults to
None
.size_limit (Union[int, str], optional) – Optional shard size limit, after which point to start a new shard. If None, puts everything in one shard. Can specify bytes human-readable format as well, for example
"100kb"
for 100 kilobyte (100*1024) and so on. Defaults to1 << 26
**kwargs (Any) –
Additional settings for the Writer.
- progress_bar (bool): Display TQDM progress bars for uploading output dataset files to
a remote location. Default to
False
.- max_workers (int): Maximum number of threads used to upload output dataset files in
parallel to a remote location. One thread is responsible for uploading one shard file to a remote location. Default to
min(32, (os.cpu_count() or 1) + 4)
.- exist_ok (bool): If the local directory exists and is not empty, whether to overwrite
the content or raise an error. False raises an error. True deletes the content and starts fresh. Defaults to False.
- encode_sample(sample)[source]#
Encode a sample dict to bytes.
- Parameters
sample (Dict[str, Any]) – Sample dict.
- Returns
bytes – Sample encoded as bytes.