Dilated convolutions

ICLR 2016       Posted on May 5, 2020

Paper link

Take away message

  1. Dilated convolution operator, which can expend the receptive field without losing resolution or coverage, is very suitable for dense prediction task.
  2. multi-scale context module, can reliably increases accuracy when plugged into existing semantic segmentation systems.
  3. Replace pooling layer (designed for classification task) with dilated convolution (designed for segmentation task) can increase accuracy.

Model

  1. dilated convolutions
    • $l$-dilated convolution: The familiar discrete convolution is simply the 1-dilated convolution.

      use $∗_l$ to represent an $l$-dilated convolution
    • Let $F_0, F_1, \cdots , F_{n−1} : Z^2 \rightarrow R$ be discrete functions and let $k_0, k_1, \cdots, k_{n−2} : Ω_1 \rightarrow R$ ($Ω_r = [−r, r]^2 ∩ Z^2$) be discrete 3×3 filter. Consider applying the filters with exponentially increasing dilation: \(F_{i+1} = F_i ∗_{2^i}k_i\text{ for } i = 0, 1, \cdots , n − 2\), then the size of the receptive field of each element in $F_{i+1}$ is $(2^{i+2} − 1)×(2^{i+2} − 1)$. -w709
  2. context module

    The context module is designed to increase the performance of dense prediction architectures by aggregating multi-scale contextual information. The module takes C feature maps as input and produces C feature maps as output. The input and output have the same form, thus the module can be plugged into existing dense prediction architectures.

    • module architecture

      here truncation is a ReLU function $f(\cdot)=max(\cdot, 0)$.
      -w656

    • initialization

      Experiments revealed that standard initialization procedures (random initialization) do not readily support the training of the module.

      • Basic: identity initialization
      • Large:
  3. front end
    • adapted the VGG-16 network for dense prediction and removed the last two pooling and striding layers. Specifically, each of these pooling and striding layers was removed and convolutions in all subsequent layers were dilated by a factor of 2 for each pooling layer that was ablated.
    • use reflection padding: the buffer zone is filled by reflecting the image about each edge.

Result

  1. Front end -w599 -w609

  2. context module -w599 -w606