JSONWriter#
- class streaming.JSONWriter(*, columns, newline='\n', out, keep_local=False, compression=None, hashes=None, size_limit=67108864, **kwargs)[source]#
Writes a streaming JSON dataset.
- Parameters:
newline (str) – Newline character inserted between samples. Defaults to
\\n
.Output dataset directory to save shard files.
If
out
is a local directory, shard files are saved locally.If
out
is a remote directory, a local temporary directory is created to cache the shard files and then the shard files are uploaded to a remote location. At the end, the temp directory is deleted once shards are uploaded.If
out
is a tuple of(local_dir, remote_dir)
, shard files are saved in the local_dir and also uploaded to a remote location.
keep_local (bool) – If the dataset is uploaded, whether to keep the local dataset directory or remove it after uploading. Defaults to
False
.compression (str, optional) – Optional compression or compression:level. Defaults to
None
.hashes (List[str], optional) – Optional list of hash algorithms to apply to shard files. Defaults to
None
.size_limit (Union[int, str], optional) – Optional shard size limit, after which point to start a new shard. If None, puts everything in one shard. Can specify bytes human-readable format as well, for example
"100kb"
for 100 kilobyte (100*1024) and so on. Defaults to1 << 26
.**kwargs (Any) –
Additional settings for the Writer.
- progress_bar (bool): Display TQDM progress bars for uploading output dataset files to
a remote location. Default to
False
.- max_workers (int): Maximum number of threads used to upload output dataset files in
parallel to a remote location. One thread is responsible for uploading one shard file to a remote location. Default to
min(32, (os.cpu_count() or 1) + 4)
.
- encode_sample(sample)[source]#
Encode a sample dict to bytes.
- Parameters:
sample (Dict[str, Any]) – Sample dict.
- Returns:
bytes – Sample encoded as bytes.