# 🎰 Stochastic Depth (Sample)#

Computer Vision

Sample-wise stochastic depth is a regularization technique for networks with residual connections that probabilistically drops samples after the transformation function in each residual block. This means that different samples go through different combinations of blocks.

## How to Use#

### Functional Interface#

# Run the Stochastic Depth algorithm directly on the model using the Composer functional API

import torch
import torch.nn.functional as F
import composer.functional as cf

from torchvision.models import resnet50

# Training

# Stochastic depth can only be run on ResNet-50/101/152
model = resnet50()

# only need to pass in opt if apply_stochastic_depth is used after the optimizer
# creation; otherwise only the model needs to be passed in
cf.apply_stochastic_depth(
model,
target_layer_name='ResNetBottleneck',
stochastic_method='sample',
drop_rate=0.2,
drop_distribution='linear',
optimizers=opt
)

loss_fn = F.cross_entropy
model.train()

for epoch in range(10):
y_hat = model(X)
loss = loss_fn(y_hat, y)
loss.backward()
opt.step()


### Composer Trainer#

# Instantiate the algorithm and pass it into the Trainer
# The trainer will automatically run it at the appropriate point in the training loop

from composer.algorithms import StochasticDepth
from composer.trainer import Trainer

# Train model

# Stochastic depth can only be run on ResNet-50/101/152
model = resnet50()

stochastic_depth = StochasticDepth(
target_layer_name='ResNetBottleneck',
stochastic_method='sample',
drop_rate=0.2,
drop_distribution='linear'
)

trainer = Trainer(
model=model,
max_duration='10ep',
algorithms=[stochastic_depth]
)

trainer.fit()


### Implementation Details#

The Composer implementation of Stochastic Depth uses model surgery to replace residual bottleneck blocks with analogous stochastic versions. When training, samples are dropped after the transformation function in a residual block by multiplying the batch by a binary vector. The binary vector is generated by sampling independent Bernoulli distributions with probability (1 - drop_rate). After the samples are dropped, the skip connection is added as usual. During inference, no samples are dropped, but the batch of samples is scaled by (1 - drop_rate) to compensate for the drop frequency when training.

## Suggested Hyperparameters#

We observe that drop_rate=0.1 and drop_distribution=linear yield maximum accuracy improvements on both ResNet-50 and ResNet-101.

## Technical Details#

For both ResNet-50 and ResNet-101 on ImageNet, we measure a +0.4% absolute accuracy improvement when using drop_rate=0.1 and drop_distribution=linear. The training wall-clock time is approximately 5% longer when using sample-wise stochastic depth.