get_shuffle_py2s#
- streaming.base.shuffle.get_shuffle_py2s(shard_sizes, num_canonical_nodes, seed, epoch, block_size=262144)[source]#
Get the shuffled global ordering of samples for an epoch.
The assignment of shards to nodes is fixed across epochs, but each grouping of shards is processed concurrently in a different order by each nodeβs workers each epoch.
- Parameters
shard_sizes (NDArray[np.int64]) β Number of samples contained in each shard, in order.
num_canonical_nodes (int) β Number of canonical nodes.
seed (int) β Base random seed, which is held constant over an entire training run.
epoch (int) β Current epoch, which is added to the seed to get a different deterministic shuffle each epoch.
block_size (int) β Unit of shuffle (ignored, because we shuffle on the basis of shards). Defaults to
1 << 18
.
- Returns
NDArray[np.int64] β 1:1 mapping of sample ID to shuffled sample ID.