composer.utils.dist.initialize_dist(device, timeout=300.0)[source]#

Initialize the default PyTorch distributed process group.

This function assumes that the following environment variables are set:

  • RANK: The global rank of the process, which should be on [0; WORLD_SIZE - 1].

  • LOCAL_RANK: The local rank for the process, which should be on [0; LOCAL_WORLD_SIZE - 1].

  • NODE_RANK: The rank of the node.

  • WORLD_SIZE: The total number of processes.

  • LOCAL_WORLD_SIZE: The number of processes on the current node.

  • MASTER_ADDR: The hostname for the rank-zero process.

  • MASTER_PORT: The port for the rank-zero process.

If none of the environment variables are set, this function will assume a single-rank configuration and initialize the default process group using a torch.distributed.HashStore store.

  • device (str | Device) โ€“ The device from which the distributed backend is interpreted. Either a string corresponding to a device (one of 'cpu', 'gpu', 'mps', or 'tpu') or a Device.

  • timeout (float, optional) โ€“ The timeout for operations executed against the process group, expressed in seconds. (default: 300.0).