- streaming.base.shuffle.get_shuffle_py1e(shard_sizes, num_canonical_nodes, seed, epoch, block_size=262144)#
Get the shuffled global ordering of samples for an epoch.
The assignment of shards to nodes is fixed across epochs, but each grouping of shards is processed concurrently in a different order by each node’s workers each epoch.
shard_sizes (NDArray[np.int64]) – Number of samples contained in each shard, in order.
num_canonical_nodes (int) – Number of canonical nodes.
seed (int) – Base random seed, which is held constant over an entire training run.
epoch (int) – Current epoch, which is added to the seed to get a different deterministic shuffle each epoch.
block_size (int) – Unit of shuffle, used to set the std and clip length for the gaussian noise to be added to each shard. Defaults to
1 << 18.
NDArray[np.int64] – 1:1 mapping of sample ID to shuffled sample ID.