merge_index#
- streaming.base.util.merge_index(*args, **kwargs)[source]#
Merge index.json from partitions to form a global index.json.
This can be called as
merge_index(index_file_urls, out, keep_local, download_timeout)
merge_index(out, keep_local, download_timeout)
The first signature takes in a list of index files URLs of MDS partitions. The second takes the root of a MDS dataset and parse the partition folders from there.
- Parameters
index_file_urls (List[Union[str, Tuple[str,str]]]) –
index.json from all the partitions. Each element can take the form of a single path string or a tuple string.
If
index_file_urls
is a List of local URLs, merge locally without download.If
index_file_urls
is a List of tuple (local, remote) URLs, check if local index.json are missing, download before merging.If
index_file_urls
is a List of remote URLs, download all and merge.
out (Union[str, Tuple[str,str]]) –
folder that contain MDS partitions and to put the merged index file
A local directory, merge index happens locally.
A remote directory, download all the sub-directories index.json, merge locally and upload.
A tuple (local_dir, remote_dir), check if local index.json exist, download if not.
keep_local (bool) – Keep local copy of the merged index file. Defaults to
True
.download_timeout (int) – The allowed time for downloading each json file. Defaults to 60.