composer.utils.dist#

Helper methods for torch.distributed.

To use torch.distributed, launch your training script with the composer launcher for distributed training. For example, the following command launches an eight-process training run.

composer -n 8 path/to/train.py

The composer launcher will automatically configure the following environment variables, which are required for distributed training:

RANK: The global rank of the process, which should be on [0; WORLD_SIZE - 1].
LOCAL_RANK: The local rank for the process, which should be on [0; LOCAL_WORLD_SIZE - 1].
NODE_RANK: The rank of the node.
WORLD_SIZE: The total number of processes.
LOCAL_WORLD_SIZE: The number of processes on the current node.
MASTER_ADDR: The hostname for the rank-zero process.
MASTER_PORT: The port for the rank-zero process.

If none of these environment variables are set, this module will safely assume a single-rank configuration, where:

RANK=0
LOCAL_RANK=0
NODE_RANK=0
WORLD_SIZE=1
LOCAL_WORLD_SIZE=1

Functions

`all_gather`	Collects a `Tensor` from each rank.
`all_gather_object`	Collect a pickleable object from each rank and return a list of these objects indexed by rank.
`all_reduce`	Reduce a `tensor` by applying the `reduce_operation`.
`barrier`	Synchronizes all processes.
`broadcast`	Broadcasts the tensor to the whole group.
`broadcast_object_list`	Broadcasts picklable objects in `object_list` to the whole group.
`get_global_rank`	Returns the global rank of the current process in the input PG, which is on `[0; group.WORLD_SIZE - 1]`.
`get_local_rank`	Returns the local rank for the current process, which is on `[0; LOCAL_WORLD_SIZE - 1]`.
`get_local_world_size`	Returns the local world size, which is the number of processes for the current node.
`get_node_rank`	Returns the node rank.
`get_sampler`	Constructs a `DistributedSampler` for a dataset.
`get_world_size`	Returns the world size, which is the number of processes participating in this training run.
`initialize_dist`	Initialize the default PyTorch distributed process group.
`is_available`	Returns whether PyTorch was built with distributed support.
`is_initialized`	Returns whether PyTorch distributed is initialized.