❄️ Layer Freezing#

[How to Use] - [Suggested Hyperparameters] - [Technical Details] - [Attribution] - [API Reference]

Computer Vision

Layer Freezing gradually makes early modules untrainable (“freezing” them), saving the cost of backpropagating to and updating frozen modules. The hypothesis behind Layer Freezing is that early layers may learn their features sooner than later layers, meaning they do not need to be updated later in training.

How to Use#

Functional Interface#

# Run the layer freezing algorithm using the Composer functional API
import torch
import torch.nn.functional as F

from composer import functional as cf

def training_loop(model, train_loader):
    opt = torch.optim.Adam(model.parameters())
    loss_fn = F.cross_entropy
    model.train()

    for epoch in range(num_epochs):
        for X, y in train_loader:
            y_hat = model(X)
            loss = loss_fn(y_hat, y)
            loss.backward()
            opt.step()
            opt.zero_grad()

        # Applying layer freezing at the end of the epoch
        freeze_depth, freeze_level = freeze_layers(
                                        model=model,
                                        optimizers=opt,
                                        current_duration=epoch/num_epochs,
                                        freeze_start=0.0,
                                        freeze_level=1.0
                                    )

Composer Trainer#

# Instantiate the algorithm and pass it into the Trainer
# The trainer will automatically run it at the appropriate points in the training loop

from composer.algorithms import LayerFreezing
from composer.trainer import Trainer

layer_freezing_algorithm = LayerFreezing(freeze_start=0.0, freeze_level=1.0)

trainer = Trainer(
    model=model,
    train_dataloader=train_dataloader,
    eval_dataloader=eval_dataloader,
    max_duration='1ep',
    algorithms=[layer_freezing_algorithm]
)

trainer.fit()

Implementation Details#

At the end of each epoch after freeze_start, the algorithm traverses the ownership tree of torch.nn.Module objects within one’s model in depth-first order to obtain a list of all modules. Note that this ordering may differ from the order in which modules are actually used in the forward pass. Given this list of modules, the algorithm computes how many modules to freeze. This number increases linearly over time such that no modules are frozen at freeze_start and a fraction equal to freeze_level are frozen at the end of training. Modules are frozen by removing their parameters from the optimizer’s param_groups. However, their associated state dict entries are not removed.

Suggested Hyperparameters#

Layer Freezing works best when the entire network is trainable before freezing begins. We have found that freeze_start should be at least 0.1. The setting of freeze_level is context specific.

Technical Details#

Layer freezing begins freezing the earliest layers in the network at the point in training specified by freeze_start (e.g., 10% of the way into training if freeze_start=0.1). At that point, it begins freezing modules early in the network. Over the remainder of training, it progressively freezes later layers in the network. It freezes these layers linearly over time until the latest layer to be frozen (specified by freeze_level) gets frozen prior to the end of training.

We have yet to observe a significant improvement in the tradeoff between speed and accuracy using this Layer Freezing on our computer vision benchmarks. We’ve observed that layer freezing can increase throughput by ~5% for ResNet-50 on ImageNet but decreases accuracy by 0.5-1%. This is not an especially good speed vs accuracy tradeoff. Existing papers have generally also not found effective tradeoffs on large-scale problems. For ResNet-56 on CIFAR-100, we have observed an accuracy lift from 75.82% to 76.22% with a similar ~5% speed increase. However, these results used specific hyperparameters without replicates and should be interpreted with caution.

❗ There is No Evidence that Layer Freezing Improves the Tradeoff Between Model Quality and Training Speed

Although layer freezing does improve throughput, it can also leads to accuracy reductions (as we observed for ResNet-50 on ImageNet). This tradeoff between improved throughput and reduced quality was not worthwhile in our experiments: it did not improve the pareto frontier of the tradoeff between quality and training speed.

🚧 Composing Regularization Methods

Layer freezing is a relaxed version of early stopping that stops training the model gradually rather than all at once. It can therefore be understood as a form of regularization. As a general rule, composing regularization methods may lead to diminishing returns in quality improvements.

Attribution#

Freezing layers is an old and common practice, but our precise freezing scheme most closely resembles Freezeout: Accelerate training by progressively freezing layers by Andrew Brock, Theodore Lim, J.M. Ritchie, and Nick Westin (posted on arXiv in 2017) and SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability by Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein (published in NeurIPS 2017).

The Composer implementation of this method and the accompanying documentation were produced by Cory Stephenson at MosaicML.

API Reference#

Algorithm class: composer.algorithms.LayerFreezing

Functional: composer.functional.freeze_layers()