# 🔢 Numerics#

Composer supports several single and half-precision number formats including IEEE single-precision fp32, mixed precision training with IEEE half-precision amp_fp16, and mixed precision training with Google’s truncated bf16 format amp_bf16. The following formats are supported per accelerator:

Format

CPU

GPU - A100

GPU - V100

GPU - T4

amp_fp16 (GPU default)

amp_bf16

fp32 (CPU default)

When using the Trainer, the number format can be selected by specifying the precision argument during initialization. In the example below, we are training on a gpu device using Automatic Mixed Precision (amp_fp16):

from composer import Trainer

# use mixed precision during training
trainer = Trainer(
model=model,
max_duration='160ep',
device='gpu',
precision='amp_fp16'
)


Note

If the precision argument is not specified, the Trainer defaults to using amp_fp16 on GPU and fp32 on CPU.

## Precision#

When discussing number formats, the precision generally refers to the number of bits used to represent decimal digits (e.g. IEEE Single-precision Floating-point (fp32) represents numbers using 32-bits of computer memory). Higher precision formats can represent a greater range of numbers and more decimal places compared to lower precision formats. However, lower precision formats require less memory and, on many accelerators, enable greater compute throughput. Therefore, it is often advantageous to use lower precision formats to accelerate training.

For training, the most commonly used low precision formats are the IEEE Half-precision Floating-point (fp16) and the Brain Floating-point (bf16) formats. While both formats utilize the same number of bits in memory, bf16 offers a greater numerical range than fp16 at the expense of being able to represent fewer decimal places.

## Automatic Mixed Precision (AMP) Training#

Using half-precision number formats can boost model throughput, though not necessarily for free. The reduced representable range and increased rounding error of these formats can cause several problems during training, such as:

• Gradient underflow: Very small values can get zeroed out

• Activation/Loss overflow: Very large values can roll over to representations of Infinity(Inf) or Not a Number (NaN)

The result can be reduced model accuracy or training divergence.

The solution is to perform mixed precision training, where both single (fp32) and half-precision (fp16) formats are utilized strategically to avoid the issues above. Composer supports Automatic Mixed Precision (AMP) training using PyTorch’s torch.cuda.amp package. The Composer Trainer performs all the heavy lifting and tensor conversions automatically; the user simply has to set precision='amp_fp16' when initializing the Trainer.

Mixed precision training is usually performed as follows:

1. Compute the forward pass and loss in half-precision, except for computations that can cause activations/loss to overflow. These are performed in single-precision (e.g., BatchNorm).

2. Perform the backward pass in half-precision.

3. Store the weights and perform the optimizer step in single precision, enabling the weight update to be done more precisely.

4. Convert the model back to half-precision.

The procedure above mitigates the issues with overflow and imprecise weight updates.

Another advantage of using Composer is support for optimizer closures when using mixed precision. A closure is responsible for clearing the gradients and re-computing the loss, as required by some optimizers. The Trainer supports closures when using amp if the optimizer object has the attribute _step_supports_amp_closure set to True. Please see the SAM Method Card for an example on how closures are used and the SAMOptimizer implementation for an example on how to use the _step_supports_amp_closure flag.