MDSWriter#

class streaming.MDSWriter(*, columns, out, keep_local=False, compression=None, hashes=None, size_limit=67108864, **kwargs)[source]#

Writes a streaming MDS dataset.

Parameters:
  • columns (Dict[str, str]) – Sample columns.

  • out (str | Tuple[str, str]) –

    Output dataset directory to save shard files.

    1. If out is a local directory, shard files are saved locally.

    2. If out is a remote directory, a local temporary directory is created to cache the shard files and then the shard files are uploaded to a remote location. At the end, the temp directory is deleted once shards are uploaded.

    3. If out is a tuple of (local_dir, remote_dir), shard files are saved in the local_dir and also uploaded to a remote location.

  • keep_local (bool) – If the dataset is uploaded, whether to keep the local dataset directory or remove it after uploading. Defaults to False.

  • compression (str, optional) – Optional compression or compression:level. Defaults to None.

  • hashes (List[str], optional) – Optional list of hash algorithms to apply to shard files. Defaults to None.

  • size_limit (Union[int, str], optional) – Optional shard size limit, after which point to start a new shard. If None, puts everything in one shard. Can specify bytes human-readable format as well, for example "100kb" for 100 kilobyte (100*1024) and so on. Defaults to 1 << 26.

  • **kwargs (Any) –

    Additional settings for the Writer.

    progress_bar (bool): Display TQDM progress bars for uploading output dataset files to

    a remote location. Default to False.

    max_workers (int): Maximum number of threads used to upload output dataset files in

    parallel to a remote location. One thread is responsible for uploading one shard file to a remote location. Default to min(32, (os.cpu_count() or 1) + 4).

encode_joint_shard()[source]#

Encode a joint shard out of the cached samples (single file).

Returns:

bytes – File data.

encode_sample(sample)[source]#

Encode a sample dict to bytes.

Parameters:

sample (Dict[str, Any]) – Sample dict.

Returns:

bytes – Sample encoded as bytes.

get_config()[source]#

Get object describing shard-writing configuration.

Returns:

Dict[str, Any] – JSON object.