get_shuffle_py1b#

streaming.base.shuffle.get_shuffle_py1b(shard_sizes, num_canonical_nodes, seed, epoch, block_size=262144)[source]#

Get the shuffled global ordering of samples for an epoch.

The assignment of shards to nodes is fixed across epochs, but each grouping of shards is processed concurrently in a different order by each node’s workers each epoch.

Parameters
  • shard_sizes (NDArray[np.int64]) – Number of samples contained in each shard, in order.

  • num_canonical_nodes (int) – Number of canonical nodes.

  • seed (int) – Base random seed, which is held constant over an entire training run.

  • epoch (int) – Current epoch, which is added to the seed to get a different deterministic shuffle each epoch.

  • block_size (int) – Unit of shuffle. Defaults to 1 << 18.

Returns

NDArray[np.int64] – 1:1 mapping of sample ID to shuffled sample ID.