🤿 DeepLabv3+#

[Example] · [Architecture] · [Training Hyperparameters] · [Attribution] · [API Reference]

DeepLabv3+ is an architecture designed for semantic segmenation i.e. per-pixel classification. DeepLabv3+ takes in a feature map from a backbone architecture (e.g. ResNet-101), then outputs classifications for each pixel in the input image. Our implementation is a simple wrapper around torchvision’s ResNet for the backbone and mmsegmentation’s DeepLabv3+ for the head.

Example#

from composer.models import composer_deeplabv3

model = composer_deeplabv3(num_classes=150,
                           backbone_arch="resnet101",
                           backbone_weights="IMAGENET1K_V2",
                           sync_bn=False
)

Architecture#

Based on Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

deeplabv3plus
  • Backbone network: converts the input image into a feature map.

    • Usually ResNet-101 with the strided convolutions converted to dilations convolutions in stage 3 and 4.

    • The 3x3 convolutions in stage 3 and 4 have dilation sizes of 2 and 4, respectively, to compensate for the decreased receptive field.

    • The average pooling and classification layer are ignored.

  • Spatial Pyramid Pooling: extracts multi-resolution features from the stage 4 backbone feature map.

    • The backbone feature map is processed with four parallel convolution layers with dilations {1, 12, 24, 36} and kernel sizes {1x1, 3x3, 3x3, 3x3}.

    • In parallel to the convolutions, global average pool the backbone feature map, then bilinearly upsample to be the same spatial dimension as the feature map.

    • Concatenate the outputs from the convolutions and global average pool, then process with a 1x1 convolution.

    • The 3x3 convolutions are implemented as depth-wise convolutions to reduce memory and computation cost.

  • Decoder: converts the output of spatial pyramid pooling (SPP) to class predictions of the same spatial dimension as the input image.

    • SPP output is bilinearly upsampled to be the same spatial dimension as the output from the first stage in the backbone network.

    • A 1x1 convolution is applied to the first stage activations, then this is concatenated with the upsampled SPP output.

    • The concatenation is processed by a 3x3 convolution with dropout followed by a classification layer.

    • The predictions are bilinearly upsampled to be the same resolution as the input image.

Training Hyperparameters#

We tested two sets of hyperparameters for DeepLabv3+ trained on the ADE20k dataset.

Typical ADE20k Model Hyperparameters#

  • Model: deeplabv3:

    • Initializers: kaiming_normal, bn_ones

    • Number of classes: 150

    • Backbone weights: IMAGENET1K_V1

    • Sync BatchNorm

  • Optimizer: SGD

    • Learning rate: 0.01

    • Momentum: 0.9

    • Weight decay: 5.0e-4

    • Dampening: 0

    • Nsterov: false

  • LR schedulers:

    • Polynomial:

      • Alpha_f: 0.01

      • Power: 0.9

  • Number of epochs: 127

  • Batch size: 16

  • Precision: amp

Model

mIoU

Time-to-Train on 8xA100

ResNet101-DeepLabv3+

44.17 +/- 0.17

6.385 hr

Composer ADE20k Model Hyperparameters#

  • Model: deeplabv3:

    • Initializers: kaiming_normal, bn_ones

    • Number of classes: 150

    • Backbone Architecture: resnet101

    • Sync BatchNorm

    • Backbone weights: IMAGENET1K_V2

  • Optimizer: Decoupled SGDW

    • Learning rate: 0.01

    • Momentum: 0.9

    • Weight decay: 2.0e-5

    • Dampening: 0

    • Nesterov: false

  • LR schedulers:

    • Cosine decay, t_max: 1dur

  • Number of epochs: 128

  • Batch size: 32

  • Precision: amp

Model

mIoU

Time-to-Train on 8xA100

ResNet101-DeepLabv3+

45.764 +/- 0.29

4.67 hr

Improvements:

  • New PyTorch pretrained weights

  • Cosine decay

  • Decoupled Weight Decay

  • Increase batch size to 32

  • Decrease weight decay to 2e-5

Attribution#

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation by Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, Hartwig Adam

OpenMMLab Semantic Segmentation Toolbox and Benchmark

How to Train State-Of-The-Art Models Using TorchVision’s Latest Primitives by Vasilis Vryniotis

API Reference#

class composer.models.deeplabv3.composer_deeplabv3(num_classes, backbone_arch='resnet101', backbone_weights=None, sync_bn=True, use_plus=True, ignore_index=- 1, cross_entropy_weight=1.0, dice_weight=0.0, initializers=())[source]
Helper function to create a ComposerClassifier with a DeepLabv3(+) model. Logs

Mean Intersection over Union (MIoU) and Cross Entropy during training and validation.

From Rethinking Atrous Convolution for Semantic Image Segmentation

(Chen et al, 2017).

Parameters
  • num_classes (int) – Number of classes in the segmentation task.

  • backbone_arch (str, optional) – The architecture to use for the backbone. Must be either ['resnet50', 'resnet101']. Default: 'resnet101'.

  • backbone_weights (str, optional) – If specified, the PyTorch pre-trained weights to load for the backbone. Currently, only [‘IMAGENET1K_V1’, ‘IMAGENET1K_V2’] are supported. Default: None.

  • sync_bn (bool, optional) – If True, replace all BatchNorm layers with SyncBatchNorm layers. Default: True.

  • use_plus (bool, optional) – If True, use DeepLabv3+ head instead of DeepLabv3. Default: True.

  • ignore_index (int) – Class label to ignore when calculating the loss and other metrics. Default: -1.

  • cross_entropy_weight (float) – Weight to scale the cross entropy loss. Default: 1.0.

  • dice_weight (float) – Weight to scale the dice loss. Default: 0.0.

  • initializers (List[Initializer], optional) – Initializers for the model. [] for no initialization. Default: [].

Returns

ComposerModel – instance of ComposerClassifier with a DeepLabv3(+) model.

Example:

from composer.models import composer_deeplabv3

model = composer_deeplabv3(num_classes=150, backbone_arch='resnet101', backbone_weights=None)