initialize_dist#
- composer.utils.dist.initialize_dist(device, timeout=300.0)[source]#
Initialize the default PyTorch distributed process group.
This function assumes that the following environment variables are set:
RANK
: The global rank of the process, which should be on[0; WORLD_SIZE - 1]
.LOCAL_RANK
: The local rank for the process, which should be on[0; LOCAL_WORLD_SIZE - 1]
.NODE_RANK
: The rank of the node.WORLD_SIZE
: The total number of processes.LOCAL_WORLD_SIZE
: The number of processes on the current node.MASTER_ADDR
: The hostname for the rank-zero process.MASTER_PORT
: The port for the rank-zero process.
If none of the environment variables are set, this function will assume a single-rank configuration and initialize the default process group using a
torch.distributed.HashStore
store.- Parameters
device (str | Device) โ The device from which the distributed backend is interpreted. Either a string corresponding to a device (one of
'cpu'
,'gpu'
,'mps'
, or'tpu'
) or aDevice
.timeout (float, optional) โ The timeout for operations executed against the process group, expressed in seconds. (default:
300.0
).