MDSWriter#

class streaming.MDSWriter(*, columns, out, keep_local=False, compression=None, hashes=None, size_limit=67108864, **kwargs)[source]#

Writes a streaming MDS dataset.

Parameters:
  • columns (Dict[str, str]) – Sample columns.

  • out (str | Tuple[str, str]) –

    Output dataset directory to save shard files.

    1. If out is a local directory, shard files are saved locally.

    2. If out is a remote directory, a local temporary directory is created to cache the shard files and then the shard files are uploaded to a remote location. At the end, the temp directory is deleted once shards are uploaded.

    3. If out is a tuple of (local_dir, remote_dir), shard files are saved in the local_dir and also uploaded to a remote location.

  • keep_local (bool) – If the dataset is uploaded, whether to keep the local dataset directory or remove it after uploading. Defaults to False.

  • compression (str, optional) – Optional compression or compression:level. Defaults to None.

  • hashes (List[str], optional) – Optional list of hash algorithms to apply to shard files. Defaults to None.

  • size_limit (Union[int, str], optional) – Optional shard size limit, after which point to start a new shard. If None, puts everything in one shard. Can specify bytes human-readable format as well, for example "100kb" for 100 kilobyte (100*1024) and so on. Defaults to 1 << 26.

  • **kwargs (Any) –

    Additional settings for the Writer.

    progress_bar (bool): Display TQDM progress bars for uploading output dataset files to

    a remote location. Default to False.

    max_workers (int): Maximum number of threads used to upload output dataset files in

    parallel to a remote location. One thread is responsible for uploading one shard file to a remote location. Default to min(32, (os.cpu_count() or 1) + 4).

    exist_ok (bool): If the local directory exists and is not empty, whether to overwrite

    the content or raise an error. False raises an error. True deletes the content and starts fresh. Defaults to False.

encode_joint_shard()[source]#

Encode a joint shard out of the cached samples (single file).

Returns:

bytes – File data.

encode_sample(sample)[source]#

Encode a sample dict to bytes.

Parameters:

sample (Dict[str, Any]) – Sample dict.

Returns:

bytes – Sample encoded as bytes.

get_config()[source]#

Get object describing shard-writing configuration.

Returns:

Dict[str, Any] – JSON object.