composer.trainer.ddp#
Helpers for running distributed data parallel training.
Classes
How and when DDP gradient synchronization should happen. |
- class composer.trainer.ddp.DDPSyncStrategy(value)[source]#
Bases:
composer.utils.string_enum.StringEnumHow and when DDP gradient synchronization should happen.
- SINGLE_AUTO_SYNC#
The default behavior for DDP. Gradients are synchronized as they computed, for only the final microbatch of a batch. This is the most efficient strategy, but can lead to errors when
find_unused_parametersis set, since it is possible different microbatches may use different sets of parameters, leading to an incomplete sync.
- MULTI_AUTO_SYNC#
The default behavior for DDP when
find_unused_parametersis set. Gradients are synchronized as they are computed for all microbatches. This ensures complete synchronization, but is less efficient thanSINGLE_AUTO_SYNC. This efficiency gap is usually small, as long as either DDP syncs are a small portion of the trainerโs overall runtime, or the number of microbatches per batch is relatively small.
- FORCED_SYNC#
Gradients are manually synchronized only after all gradients have been computed for the final microbatch of a batch. Like
MULTI_AUTO_SYNC, this strategy ensures complete gradient synchronization, but this tends to be slower thanMULTI_AUTO_SYNC. This is because ordinarily syncs can happen in parallel with theloss.backward()computation, meaning syncs can be mostly complete by the time that function finishes. However, in certain circumstances, syncs may take a very long time to complete - if there are also a lot of microbatches per batch, this strategy may be optimal.