get_shuffle_py1e#
- streaming.base.shuffle.get_shuffle_py1e(shard_sizes, num_canonical_nodes, seed, epoch, block_size=262144)[source]#
Get the shuffled global ordering of samples for an epoch.
The assignment of shards to nodes is fixed across epochs, but each grouping of shards is processed concurrently in a different order by each nodeβs workers each epoch.
- Parameters
shard_sizes (NDArray[np.int64]) β Number of samples contained in each shard, in order.
num_canonical_nodes (int) β Number of canonical nodes.
seed (int) β Base random seed, which is held constant over an entire training run.
epoch (int) β Current epoch, which is added to the seed to get a different deterministic shuffle each epoch.
block_size (int) β Unit of shuffle, used to set the std and clip length for the gaussian noise to be added to each shard. Defaults to
1 << 18
.
- Returns
NDArray[np.int64] β 1:1 mapping of sample ID to shuffled sample ID.