# composer.utils.dist#

Helper methods for torch.distributed.

To use torch.distributed, launch your training script with the composer launcher for distributed training. For example, the following command launches an eight-process training run.

composer -n 8 path/to/train.py


The composer launcher will automatically configure the following environment variables, which are required for distributed training:

• RANK: The global rank of the process, which should be on [0; WORLD_SIZE - 1].

• LOCAL_RANK: The local rank for the process, which should be on [0; LOCAL_WORLD_SIZE - 1].

• NODE_RANK: The rank of the node.

• WORLD_SIZE: The total number of processes.

• LOCAL_WORLD_SIZE: The number of processes on the current node.

• MASTER_ADDR: The hostname for the rank-zero process.

• MASTER_PORT: The port for the rank-zero process.

If none of these environment variables are set, this module will safely assume a single-rank configuration, where:

RANK=0
LOCAL_RANK=0
NODE_RANK=0
WORLD_SIZE=1
LOCAL_WORLD_SIZE=1


Functions

 all_gather Collects a Tensor from each rank. all_gather_object Collect a pickleable object from each rank and return a list of these objects indexed by rank. all_reduce Reduce a tensor by applying the reduce_operation. barrier Synchronizes all processes. broadcast Broadcasts the tensor to the whole group. broadcast_object_list Broadcasts picklable objects in object_list to the whole group. get_global_rank Returns the global rank of the current process, which is on [0; WORLD_SIZE - 1]. get_local_rank Returns the local rank for the current process, which is on [0; LOCAL_WORLD_SIZE - 1]. get_local_world_size Returns the local world size, which is the number of processes for the current node. get_node_rank Returns the node rank. get_sampler Constructs a DistributedSampler for a dataset. get_world_size Returns the world size, which is the number of processes participating in this training run. initialize_dist Initialize the default PyTorch distributed process group. is_available Returns whether PyTorch was built with distributed support. is_initialized Returns whether PyTorch distributed is initialized.