composer.utils.dist#
Helper methods for torch.distributed
.
To use torch.distributed
, launch your training script with the
composer launcher for distributed training. For example,
the following command launches an eight-process training run.
composer -n 8 path/to/train.py
The composer launcher will automatically configure the following environment variables, which are required for distributed training:
RANK
: The global rank of the process, which should be on[0; WORLD_SIZE - 1]
.LOCAL_RANK
: The local rank for the process, which should be on[0; LOCAL_WORLD_SIZE - 1]
.NODE_RANK
: The rank of the node.WORLD_SIZE
: The total number of processes.LOCAL_WORLD_SIZE
: The number of processes on the current node.MASTER_ADDR
: The hostname for the rank-zero process.MASTER_PORT
: The port for the rank-zero process.
If none of these environment variables are set, this module will safely assume a single-rank configuration, where:
RANK=0
LOCAL_RANK=0
NODE_RANK=0
WORLD_SIZE=1
LOCAL_WORLD_SIZE=1
Functions
Collects a |
|
Collect a pickleable object from each rank and return a list of these objects indexed by rank. |
|
Reduce a |
|
Synchronizes all processes. |
|
Broadcasts the tensor to the whole group. |
|
Broadcasts picklable objects in |
|
Returns the global rank of the current process, which is on |
|
Returns the local rank for the current process, which is on |
|
Returns the local world size, which is the number of processes for the current node. |
|
Returns the node rank. |
|
Constructs a |
|
Returns the world size, which is the number of processes participating in this training run. |
|
Initialize the default PyTorch distributed process group. |
|
Returns whether PyTorch was built with distributed support. |
|
Returns whether PyTorch distributed is initialized. |