# composer.trainer.ddp#

Helpers for running distributed data parallel training.

Functions

 ddp_sync_context A context manager for handling the DDPSyncStrategy. prepare_ddp_module Wraps the module in a torch.nn.parallel.DistributedDataParallel object if running distributed training.

Classes

 DDPSyncStrategy How and when DDP gradient synchronization should happen.
class composer.trainer.ddp.DDPSyncStrategy(value)[source]#

How and when DDP gradient synchronization should happen.

SINGLE_AUTO_SYNC#

The default behavior for DDP. Gradients are synchronized as they computed, for only the final microbatch of a batch. This is the most efficient strategy, but can lead to errors when find_unused_parameters is set, since it is possible different microbatches may use different sets of parameters, leading to an incomplete sync.

MULTI_AUTO_SYNC#

The default behavior for DDP when find_unused_parameters is set. Gradients are synchronized as they are computed for all microbatches. This ensures complete synchronization, but is less efficient than SINGLE_AUTO_SYNC. This efficiency gap is usually small, as long as either DDP syncs are a small portion of the trainer’s overall runtime, or the number of microbatches per batch is relatively small.

FORCED_SYNC#

Gradients are manually synchronized only after all gradients have been computed for the final microbatch of a batch. Like MULTI_AUTO_SYNC, this strategy ensures complete gradient synchronization, but this tends to be slower than MULTI_AUTO_SYNC. This is because ordinarily syncs can happen in parallel with the loss.backward() computation, meaning syncs can be mostly complete by the time that function finishes. However, in certain circumstances, syncs may take a very long time to complete - if there are also a lot of microbatches per batch, this strategy may be optimal.

composer.trainer.ddp.ddp_sync_context(state, is_final_microbatch, sync_strategy)[source]#

A context manager for handling the DDPSyncStrategy.

Parameters
composer.trainer.ddp.prepare_ddp_module(module, find_unused_parameters)[source]#

Wraps the module in a torch.nn.parallel.DistributedDataParallel object if running distributed training.

Parameters
• module (Module) – The module to wrap.

• find_unused_parameters (bool) – Whether or not to do a pass over the autograd graph to find parameters to not expect gradients for. This is useful if there are some parameters in the model that are not being trained.