composer.utils.dist#
Helper methods for torch.distributed.
To use torch.distributed, launch your training script with the
composer launcher for distributed training. For example,
the following command launches an eight-process training run.
composer -n 8 path/to/train.py
The composer launcher will automatically configure the following environment variables, which are required for distributed training:
RANK: The global rank of the process, which should be on[0; WORLD_SIZE - 1].LOCAL_RANK: The local rank for the process, which should be on[0; LOCAL_WORLD_SIZE - 1].NODE_RANK: The rank of the node.WORLD_SIZE: The total number of processes.LOCAL_WORLD_SIZE: The number of processes on the current node.MASTER_ADDR: The hostname for the rank-zero process.MASTER_PORT: The port for the rank-zero process.
If none of these environment variables are set, this module will safely assume a single-rank configuration, where:
RANK=0
LOCAL_RANK=0
NODE_RANK=0
WORLD_SIZE=1
LOCAL_WORLD_SIZE=1
Functions
Collects a |
|
Collect a pickleable object from each rank and return a list of these objects indexed by rank. |
|
Reduce a |
|
Synchronizes all processes. |
|
Broadcasts the tensor to the whole group. |
|
Broadcasts picklable objects in |
|
Returns the global rank of the current process, which is on |
|
Returns the local rank for the current process, which is on |
|
Returns the local world size, which is the number of processes for the current node. |
|
Returns the node rank. |
|
Constructs a |
|
Returns the world size, which is the number of processes participating in this training run. |
|
Initialize the default PyTorch distributed process group. |
|
Returns whether PyTorch was built with distributed support. |
|
Returns whether PyTorch distributed is initialized. |
- composer.utils.dist.all_gather(tensor)[source]#
Collects a
Tensorfrom each rank and return a sequence ofTensors indexed by rank.See also
- Parameters
tensor (Tensor) โ Tensor from each rank to be gathered.
- Returns
Sequence[Tensor] โ A sequence of tensors indexed by rank.
- composer.utils.dist.all_gather_object(obj)[source]#
Collect a pickleable object from each rank and return a list of these objects indexed by rank.
- Parameters
obj (TObj) โ Object to be gathered.
- Returns
List[TObj] โ A list of objects indexed by rank.
- composer.utils.dist.all_reduce(tensor, reduce_operation='SUM')[source]#
Reduce a
tensorby applying thereduce_operation.All ranks get the same, bitwise-identical result.
See also
- Parameters
tensor (Tensor) โ Input and output of the collective. The function operates in-place.
op (optional) โ One of the values from
torch.distributed.ReduceOpenum. Specifies an operation used for element-wise reductions.tensor โ Tensor to reduce. The function operates in-place.
reduce_operation (str, optional) โ
The reduction operation (default:
SUM).- Valid options are:
SUMPRODUCTMINMAXBANDBORBXOR
- Returns
None โ
tensoris modified in-place.
- composer.utils.dist.barrier()[source]#
Synchronizes all processes.
This function blocks until all processes reach this function.
See also
- composer.utils.dist.broadcast(tensor, src)[source]#
Broadcasts the tensor to the whole group.
tensormust have the same number of elements in all processes participating in the collective. Seetorch.distributed.broadcast().
- composer.utils.dist.broadcast_object_list(object_list, src=0)[source]#
Broadcasts picklable objects in
object_listto the whole group.Similar to
broadcast(), but Python objects can be passed in. Note that all objects inobject_listmust be picklable in order to be broadcasted.See also
- Parameters
- Returns
None โ
object_listwill be modified in-place and set to values ofobject_listfrom thesrcrank.
- composer.utils.dist.get_global_rank()[source]#
Returns the global rank of the current process, which is on
[0; WORLD_SIZE - 1].- Returns
int โ The global rank.
- composer.utils.dist.get_local_rank()[source]#
Returns the local rank for the current process, which is on
[0; LOCAL_WORLD_SIZE - 1].- Returns
int โ The local rank.
- composer.utils.dist.get_local_world_size()[source]#
Returns the local world size, which is the number of processes for the current node.
- Returns
int โ The local world size.
- composer.utils.dist.get_node_rank()[source]#
Returns the node rank. For example, if there are 2 nodes, and 2 ranks per node, then global ranks 0-1 will have a node rank of 0, and global ranks 2-3 will have a node rank of 1.
- Returns
int โ The node rank, starting at 0.
- composer.utils.dist.get_sampler(dataset, *, drop_last, shuffle)[source]#
Constructs a
DistributedSamplerfor a dataset. TheDistributedSamplerassumes that each rank has a complete copy of the dataset. It ensures that each rank sees a unique shard for each epoch containinglen(datset) / get_world_size()samples.Note
If the
datasetis already shareded by rank, use aSequentialSamplerorRandomSampler.
- composer.utils.dist.get_world_size()[source]#
Returns the world size, which is the number of processes participating in this training run.
- Returns
int โ The world size.
- composer.utils.dist.initialize_dist(backend, timeout)[source]#
Initialize the default PyTorch distributed process group.
This function assumes that the following environment variables are set:
RANK: The global rank of the process, which should be on[0; WORLD_SIZE - 1].LOCAL_RANK: The local rank for the process, which should be on[0; LOCAL_WORLD_SIZE - 1].NODE_RANK: The rank of the node.WORLD_SIZE: The total number of processes.LOCAL_WORLD_SIZE: The number of processes on the current node.MASTER_ADDR: The hostname for the rank-zero process.MASTER_PORT: The port for the rank-zero process.
If none of the environment variables are set, this function will assume a single-rank configuration and initialize the default process group using a
torch.distributed.HashStorestore.
- composer.utils.dist.is_available()[source]#
Returns whether PyTorch was built with distributed support.
See also
- Returns
bool โ Whether PyTorch distributed support is available.