composer.utils.dist#
Helper methods for torch.distributed.
To use torch.distributed, launch your training script with the
composer launcher for distributed training. For example,
the following command launches an eight-process training run.
composer -n 8 path/to/train.py
The composer launcher will automatically configure the following environment variables, which are required for distributed training:
RANK: The global rank of the process, which should be on[0; WORLD_SIZE - 1].LOCAL_RANK: The local rank for the process, which should be on[0; LOCAL_WORLD_SIZE - 1].NODE_RANK: The rank of the node.WORLD_SIZE: The total number of processes.LOCAL_WORLD_SIZE: The number of processes on the current node.MASTER_ADDR: The hostname for the rank-zero process.MASTER_PORT: The port for the rank-zero process.
If none of these environment variables are set, this module will safely assume a single-rank configuration, where:
RANK=0
LOCAL_RANK=0
NODE_RANK=0
WORLD_SIZE=1
LOCAL_WORLD_SIZE=1
Functions
Collects a |
|
Collect a pickleable object from each rank and return a list of these objects indexed by rank. |
|
Reduce a |
|
Synchronizes all processes. |
|
Broadcasts the tensor to the whole group. |
|
Broadcasts picklable objects in |
|
Returns the global rank of the current process, which is on |
|
Returns the local rank for the current process, which is on |
|
Returns the local world size, which is the number of processes for the current node. |
|
Returns the node rank. |
|
Constructs a |
|
Returns the world size, which is the number of processes participating in this training run. |
|
Initialize the default PyTorch distributed process group. |
|
Returns whether PyTorch was built with distributed support. |
|
Returns whether PyTorch distributed is initialized. |