🤿 DeepLabv3+#

[Example] · [Architecture] · [Training Hyperparameters] · [Attribution] · [API Reference]

DeepLabv3+ is an architecture designed for semantic segmenation i.e. per-pixel classification. DeepLabv3+ takes in a feature map from a backbone architecture (e.g. ResNet-101), then outputs classifications for each pixel in the input image. Our implementation is a simple wrapper around torchvision’s ResNet for the backbone and mmsegmentation’s DeepLabv3+ for the head.

Example#

from composer.models import composer_deeplabv3

model = composer_deeplabv3(num_classes=150,
                           backbone_arch="resnet101",
                           backbone_weights="IMAGENET1K_V2",
                           sync_bn=False
)

Architecture#

Based on Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

Backbone network: converts the input image into a feature map.
- Usually ResNet-101 with the strided convolutions converted to dilations convolutions in stage 3 and 4.
- The 3x3 convolutions in stage 3 and 4 have dilation sizes of 2 and 4, respectively, to compensate for the decreased receptive field.
- The average pooling and classification layer are ignored.
Spatial Pyramid Pooling: extracts multi-resolution features from the stage 4 backbone feature map.
- The backbone feature map is processed with four parallel convolution layers with dilations {1, 12, 24, 36} and kernel sizes {1x1, 3x3, 3x3, 3x3}.
- In parallel to the convolutions, global average pool the backbone feature map, then bilinearly upsample to be the same spatial dimension as the feature map.
- Concatenate the outputs from the convolutions and global average pool, then process with a 1x1 convolution.
- The 3x3 convolutions are implemented as depth-wise convolutions to reduce memory and computation cost.
Decoder: converts the output of spatial pyramid pooling (SPP) to class predictions of the same spatial dimension as the input image.
- SPP output is bilinearly upsampled to be the same spatial dimension as the output from the first stage in the backbone network.
- A 1x1 convolution is applied to the first stage activations, then this is concatenated with the upsampled SPP output.
- The concatenation is processed by a 3x3 convolution with dropout followed by a classification layer.
- The predictions are bilinearly upsampled to be the same resolution as the input image.

Training Hyperparameters#

We tested two sets of hyperparameters for DeepLabv3+ trained on the ADE20k dataset.

Typical ADE20k Model Hyperparameters#

Model: deeplabv3:
- Initializers: kaiming_normal, bn_ones
- Number of classes: 150
- Backbone weights: IMAGENET1K_V1
- Sync BatchNorm
Optimizer: SGD
- Learning rate: 0.01
- Momentum: 0.9
- Weight decay: 5.0e-4
- Dampening: 0
- Nsterov: false
LR schedulers:
- Polynomial:
  - Alpha_f: 0.01
  - Power: 0.9
Number of epochs: 127
Batch size: 16
Precision: amp

Model	mIoU	Time-to-Train on 8xA100
ResNet101-DeepLabv3+	44.17 +/- 0.17	6.385 hr

Composer ADE20k Model Hyperparameters#

Model: deeplabv3:
- Initializers: kaiming_normal, bn_ones
- Number of classes: 150
- Backbone Architecture: resnet101
- Sync BatchNorm
- Backbone weights: IMAGENET1K_V2
Optimizer: Decoupled SGDW
- Learning rate: 0.01
- Momentum: 0.9
- Weight decay: 2.0e-5
- Dampening: 0
- Nesterov: false
LR schedulers:
- Cosine decay, t_max: 1dur
Number of epochs: 128
Batch size: 32
Precision: amp

Model	mIoU	Time-to-Train on 8xA100
ResNet101-DeepLabv3+	45.764 +/- 0.29	4.67 hr

Improvements:

New PyTorch pretrained weights
Cosine decay
Decoupled Weight Decay
Increase batch size to 32
Decrease weight decay to 2e-5

Attribution#

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation by Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, Hartwig Adam

OpenMMLab Semantic Segmentation Toolbox and Benchmark

How to Train State-Of-The-Art Models Using TorchVision’s Latest Primitives by Vasilis Vryniotis

API Reference#

class composer.models.deeplabv3.composer_deeplabv3(num_classes, backbone_arch='resnet101', backbone_weights=None, sync_bn=True, use_plus=True, ignore_index=- 1, cross_entropy_weight=1.0, dice_weight=0.0, initializers=())[source]

Helper function to create a ComposerClassifier with a DeepLabv3(+) model. Logs: Mean Intersection over Union (MIoU) and Cross Entropy during training and validation.
From Rethinking Atrous Convolution for Semantic Image Segmentation: (Chen et al, 2017).

Parameters

num_classes (int) – Number of classes in the segmentation task.
backbone_arch (str, optional) – The architecture to use for the backbone. Must be either ['resnet50', 'resnet101']. Default: 'resnet101'.
backbone_weights (str, optional) – If specified, the PyTorch pre-trained weights to load for the backbone. Currently, only [‘IMAGENET1K_V1’, ‘IMAGENET1K_V2’] are supported. Default: None.
sync_bn (bool, optional) – If True, replace all BatchNorm layers with SyncBatchNorm layers. Default: True.
use_plus (bool, optional) – If True, use DeepLabv3+ head instead of DeepLabv3. Default: True.
ignore_index (int) – Class label to ignore when calculating the loss and other metrics. Default: -1.
cross_entropy_weight (float) – Weight to scale the cross entropy loss. Default: 1.0.
dice_weight (float) – Weight to scale the dice loss. Default: 0.0.
initializers (List[Initializer], optional) – Initializers for the model. [] for no initialization. Default: [].

Returns

ComposerModel – instance of ComposerClassifier with a DeepLabv3(+) model.

Example:

from composer.models import composer_deeplabv3

model = composer_deeplabv3(num_classes=150, backbone_arch='resnet101', backbone_weights=None)