composer.utils.dist#

Helper methods for torch.distributed.

To use torch.distributed, launch your training script with the composer launcher for distributed training. For example, the following command launches an eight-process training run.

composer -n 8 path/to/train.py

The composer launcher will automatically configure the following environment variables, which are required for distributed training:

  • RANK: The global rank of the process, which should be on [0; WORLD_SIZE - 1].

  • LOCAL_RANK: The local rank for the process, which should be on [0; LOCAL_WORLD_SIZE - 1].

  • NODE_RANK: The rank of the node.

  • WORLD_SIZE: The total number of processes.

  • LOCAL_WORLD_SIZE: The number of processes on the current node.

  • MASTER_ADDR: The hostname for the rank-zero process.

  • MASTER_PORT: The port for the rank-zero process.

If none of these environment variables are set, this module will safely assume a single-rank configuration, where:

RANK=0
LOCAL_RANK=0
NODE_RANK=0
WORLD_SIZE=1
LOCAL_WORLD_SIZE=1

Functions

all_gather

Collects a Tensor from each rank.

all_gather_object

Collect a pickleable object from each rank and return a list of these objects indexed by rank.

all_reduce

Reduce a tensor by applying the reduce_operation.

barrier

Synchronizes all processes.

broadcast

Broadcasts the tensor to the whole group.

broadcast_object_list

Broadcasts picklable objects in object_list to the whole group.

get_global_rank

Returns the global rank of the current process, which is on [0; WORLD_SIZE - 1].

get_local_rank

Returns the local rank for the current process, which is on [0; LOCAL_WORLD_SIZE - 1].

get_local_world_size

Returns the local world size, which is the number of processes for the current node.

get_node_rank

Returns the node rank.

get_sampler

Constructs a DistributedSampler for a dataset.

get_world_size

Returns the world size, which is the number of processes participating in this training run.

initialize_dist

Initialize the default PyTorch distributed process group.

is_available

Returns whether PyTorch was built with distributed support.

is_initialized

Returns whether PyTorch distributed is initialized.