🎰 Stochastic Depth (Sample)#
Sample-wise stochastic depth is a regularization technique for networks with residual connections that probabilistically drops samples after the transformation function in each residual block. This means that different samples go through different combinations of blocks.
How to Use#
# Run the Stochastic Depth algorithm directly on the model using the Composer functional API import torch import torch.nn.functional as F import composer.functional as cf from torchvision.models import resnet50 # Training # Stochastic depth can only be run on ResNet-50/101/152 model = resnet50() opt = torch.optim.Adam(model.parameters()) # only need to pass in opt if apply_stochastic_depth is used after the optimizer # creation; otherwise only the model needs to be passed in cf.apply_stochastic_depth( model, target_layer_name='ResNetBottleneck', stochastic_method='sample', drop_rate=0.2, drop_distribution='linear', optimizers=opt ) loss_fn = F.cross_entropy model.train() for epoch in range(10): for X, y in train_loader: y_hat = model(X) loss = loss_fn(y_hat, y) loss.backward() opt.step() opt.zero_grad()
# Instantiate the algorithm and pass it into the Trainer # The trainer will automatically run it at the appropriate point in the training loop from composer.algorithms import StochasticDepth from composer.trainer import Trainer # Train model # Stochastic depth can only be run on ResNet-50/101/152 model = resnet50() stochastic_depth = StochasticDepth( target_layer_name='ResNetBottleneck', stochastic_method='sample', drop_rate=0.2, drop_distribution='linear' ) trainer = Trainer( model=model, train_dataloader=train_dataloader, max_duration='10ep', algorithms=[stochastic_depth] ) trainer.fit()
The Composer implementation of Stochastic Depth uses model surgery to replace residual bottleneck blocks with analogous stochastic versions. When training, samples are dropped after the transformation function in a residual block by multiplying the batch by a binary vector. The binary vector is generated by sampling independent Bernoulli distributions with probability (1 -
drop_rate). After the samples are dropped, the skip connection is added as usual. During inference, no samples are dropped, but the batch of samples is scaled by (1 -
drop_rate) to compensate for the drop frequency when training.
We observe that
drop_distribution=linear yield maximum accuracy improvements on both ResNet-50 and ResNet-101.
For both ResNet-50 and ResNet-101 on ImageNet, we measure a +0.4% absolute accuracy improvement when using
drop_distribution=linear. The training wall-clock time is approximately 5% longer when using sample-wise stochastic depth.
Deep Networks with Stochastic Depth by Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Killian Weinberger. Published in ECCV in 2016.
EfficientNet model in the TPU Github repository from Google
EfficientNet model in gen-efficientnet-pytorch Github repository by Ross Wightman