composer.utils.dist#
Helper methods for torch.distributed
.
To use torch.distributed
, launch your training script with the
composer launcher for distributed training. For example,
the following command launches an eight-process training run.
composer -n 8 path/to/train.py
The composer launcher will automatically configure the following environment variables, which are required for distributed training:
RANK
: The global rank of the process, which should be on[0; WORLD_SIZE - 1]
.LOCAL_RANK
: The local rank for the process, which should be on[0; LOCAL_WORLD_SIZE - 1]
.NODE_RANK
: The rank of the node.WORLD_SIZE
: The total number of processes.LOCAL_WORLD_SIZE
: The number of processes on the current node.MASTER_ADDR
: The hostname for the rank-zero process.MASTER_PORT
: The port for the rank-zero process.
If none of these environment variables are set, this module will safely assume a single-rank configuration, where:
RANK=0
LOCAL_RANK=0
NODE_RANK=0
WORLD_SIZE=1
LOCAL_WORLD_SIZE=1
Functions
Collects a |
|
Collect a pickleable object from each rank and return a list of these objects indexed by rank. |
|
Reduce a |
|
Synchronizes all processes. |
|
Broadcasts the tensor to the whole group. |
|
Broadcasts picklable objects in |
|
Returns the global rank of the current process, which is on |
|
Returns the local rank for the current process, which is on |
|
Returns the local world size, which is the number of processes for the current node. |
|
Returns the node rank. |
|
Constructs a |
|
Returns the world size, which is the number of processes participating in this training run. |
|
Initialize the default PyTorch distributed process group. |
|
Returns whether PyTorch was built with distributed support. |
|
Returns whether PyTorch distributed is initialized. |
- composer.utils.dist.all_gather(tensor)[source]#
Collects a
Tensor
from each rank and return a sequence ofTensor
s indexed by rank.See also
- Parameters
tensor (Tensor) โ Tensor from each rank to be gathered.
- Returns
Sequence[Tensor] โ A sequence of tensors indexed by rank.
- composer.utils.dist.all_gather_object(obj)[source]#
Collect a pickleable object from each rank and return a list of these objects indexed by rank.
- Parameters
obj (TObj) โ Object to be gathered.
- Returns
List[TObj] โ A list of objects indexed by rank.
- composer.utils.dist.all_reduce(tensor, reduce_operation='SUM')[source]#
Reduce a
tensor
by applying thereduce_operation
.All ranks get the same, bitwise-identical result.
See also
- Parameters
tensor (Tensor) โ Input and output of the collective. The function operates in-place.
op (optional) โ One of the values from
torch.distributed.ReduceOp
enum. Specifies an operation used for element-wise reductions.tensor โ Tensor to reduce. The function operates in-place.
reduce_operation (str, optional) โ
The reduction operation (default:
SUM
).- Valid options are:
SUM
PRODUCT
MIN
MAX
BAND
BOR
BXOR
- Returns
None โ
tensor
is modified in-place.
- composer.utils.dist.barrier()[source]#
Synchronizes all processes.
This function blocks until all processes reach this function.
See also
- composer.utils.dist.broadcast(tensor, src)[source]#
Broadcasts the tensor to the whole group.
tensor
must have the same number of elements in all processes participating in the collective. Seetorch.distributed.broadcast()
.
- composer.utils.dist.broadcast_object_list(object_list, src=0)[source]#
Broadcasts picklable objects in
object_list
to the whole group.Similar to
broadcast()
, but Python objects can be passed in. Note that all objects inobject_list
must be picklable in order to be broadcasted.See also
- Parameters
- Returns
None โ
object_list
will be modified in-place and set to values ofobject_list
from thesrc
rank.
- composer.utils.dist.get_global_rank()[source]#
Returns the global rank of the current process, which is on
[0; WORLD_SIZE - 1]
.- Returns
int โ The global rank.
- composer.utils.dist.get_local_rank()[source]#
Returns the local rank for the current process, which is on
[0; LOCAL_WORLD_SIZE - 1]
.- Returns
int โ The local rank.
- composer.utils.dist.get_local_world_size()[source]#
Returns the local world size, which is the number of processes for the current node.
- Returns
int โ The local world size.
- composer.utils.dist.get_node_rank()[source]#
Returns the node rank. For example, if there are 2 nodes, and 2 ranks per node, then global ranks 0-1 will have a node rank of 0, and global ranks 2-3 will have a node rank of 1.
- Returns
int โ The node rank, starting at 0.
- composer.utils.dist.get_sampler(dataset, *, drop_last, shuffle)[source]#
Constructs a
DistributedSampler
for a dataset. TheDistributedSampler
assumes that each rank has a complete copy of the dataset. It ensures that each rank sees a unique shard for each epoch containinglen(datset) / get_world_size()
samples.Note
If the
dataset
is already shareded by rank, use aSequentialSampler
orRandomSampler
.
- composer.utils.dist.get_world_size()[source]#
Returns the world size, which is the number of processes participating in this training run.
- Returns
int โ The world size.
- composer.utils.dist.initialize_dist(backend, timeout)[source]#
Initialize the default PyTorch distributed process group.
This function assumes that the following environment variables are set:
RANK
: The global rank of the process, which should be on[0; WORLD_SIZE - 1]
.LOCAL_RANK
: The local rank for the process, which should be on[0; LOCAL_WORLD_SIZE - 1]
.NODE_RANK
: The rank of the node.WORLD_SIZE
: The total number of processes.LOCAL_WORLD_SIZE
: The number of processes on the current node.MASTER_ADDR
: The hostname for the rank-zero process.MASTER_PORT
: The port for the rank-zero process.
If none of the environment variables are set, this function will assume a single-rank configuration and initialize the default process group using a
torch.distributed.HashStore
store.
- composer.utils.dist.is_available()[source]#
Returns whether PyTorch was built with distributed support.
See also
- Returns
bool โ Whether PyTorch distributed support is available.