🎰 Stochastic Depth (Sample)#

[How to Use] - [Suggested Hyperparameters] - [Technical Details] - [Attribution] - [API Reference]

Computer Vision

Sample-wise stochastic depth is a regularization technique for networks with residual connections that probabilistically drops samples after the transformation function in each residual block. This means that different samples go through different combinations of blocks.

How to Use#

Functional Interface#

# Run the Stochastic Depth algorithm directly on the model using the Composer functional API

import torch
import torch.nn.functional as F

import composer.functional as cf

# Training

# Stochastic depth can only be run on ResNet-50/101/152
model = composer_resnet('resnet50')

opt = torch.optim.Adam(model.parameters())

cf.apply_stochastic_depth(
    model,
    target_layer_name='ResNetBottleneck',
    stochastic_method='sample',
    drop_rate=0.2,
    drop_distribution='linear'
)

loss_fn = F.cross_entropy
model.train()

for epoch in range(1):
    for X, y in train_dataloader:
        y_hat = model([X, y])
        loss = loss_fn(y_hat, y)
        loss.backward()
        opt.step()
        opt.zero_grad()
        break

Composer Trainer#

# Instantiate the algorithm and pass it into the Trainer
# The trainer will automatically run it at the appropriate point in the training loop

from composer.algorithms import StochasticDepth
from composer.trainer import Trainer

# Train model

# Stochastic depth can only be run on ResNet-50/101/152
model = composer_resnet('resnet50')

stochastic_depth = StochasticDepth(
    target_layer_name='ResNetBottleneck',
    stochastic_method='sample',
    drop_rate=0.2,
    drop_distribution='linear'
)

trainer = Trainer(
    model=model,
    train_dataloader=train_dataloader,
    eval_dataloader=eval_dataloader,
    max_duration='1ep',
    algorithms=[stochastic_depth]
)

trainer.fit()

Implementation Details#

The Composer implementation of Stochastic Depth uses model surgery to replace residual bottleneck blocks with analogous stochastic versions. When training, samples are dropped after the transformation function in a residual block by multiplying the batch by a binary vector. The binary vector is generated by sampling independent Bernoulli distributions with probability (1 - drop_rate). After the samples are dropped, the skip connection is added as usual. During inference, no samples are dropped, but the batch of samples is scaled by (1 - drop_rate) to compensate for the drop frequency when training.

Suggested Hyperparameters#

We observe that drop_rate=0.1 and drop_distribution=linear yield maximum accuracy improvements on both ResNet-50 and ResNet-101.

Technical Details#

For both ResNet-50 and ResNet-101 on ImageNet, we measure a +0.4% absolute accuracy improvement when using drop_rate=0.1 and drop_distribution=linear. The training wall-clock time is approximately 5% longer when using sample-wise stochastic depth.

Attribution#

Deep Networks with Stochastic Depth by Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Killian Weinberger. Published in ECCV in 2016.

EfficientNet model in the TPU Github repository from Google

EfficientNet model in gen-efficientnet-pytorch Github repository by Ross Wightman

API Reference#

Algorithm class: composer.algorithms.StochasticDepth

Functional: composer.functional.apply_stochastic_depth()