Tosterius-Fully Convolutional Networks (FCN)

Conversion Fully Connected Layer to Convolutional layer

For any CONV layer there is an FC layer that implements the same forward function. The weight matrix would be a large matrix that is mostly zero except for at certain blocks (due to local connectivity) where the weights in many of the blocks are equal (due to parameter sharing) Source: cs231n.

If we take a look at VGG16 neural architecture we can see that the output dimension of the last convolutional layer is $7 \times 7 \times 512$. The next leayer is FC layer and has dimension $1\times1\times4096$. To get the same dimension of the output using convolutions we just need to create $4096$ filters of size $7\times7$.

sample image

FCN for Semantic segmentation

Given neural network trained on classification task. The main idea is to substitute the last fully connected layer by convolution layer. If we then scale up this last layer to our input image size we get a segmentation mask. These direct prediction of FCN are typically in low resolution, because the resolution of output feature maps is downsampled.

Upsampling is done by transposed convolution (or deconvolution).

This is because, deep features can be obtained when going deeper, spatial location information is also lost when going deeper.

FCN-32, FCN-16, FCN-8

(Some notes on the report Fully Convolutional Networks for Semantic Segmentation)

fcns
Main features:

As the base network VGG16 is used
Upsampling via deconvolution
Fusing the output (skip connections)

FCN-32:

the last fully connected layer (layer_7) is replaced by a $1\times1$ convolution layer conv_7 (see conversion)
Directly produces the segmentation map from conv7, by using a transposed convolution layer with stride 32.

FCN-16:

the last fully connected layer (layer_7) is replaced by a $1\times1$ convolution layer conv_7 (see conversion)
new layer (conv_7) is upsampled 2 times to match dimensions with pool_4 using transposed convolution with $(kernel=4, stride=2, padding=same)$. The skip connection between pool_4 and conv_7 is added
conv_8 is upsampled 8 times to match dimensions with input image size using transposed convolution with $(kernel=32, stride=16, padding=same)$. At the end we get the number of channels equal to the number of classes

FCN-8:

the last fully connected layer (layer_7) is replaced by a $1\times1$ convolution layer conv_7 (see conversion)
new layer (conv_7) is upsampled 2 times to match dimensions with pool_4 using transposed convolution with $(kernel=4, stride=2, padding=same)$. The skip connection between pool_4 and conv_7 is added
conv_8 is upsampled twice to match dimensions with pool_3 using transposed convolution with $(kernel=4, stride=2, padding=same)$ The skip connection between layer_3 and conv_9 is added
conv_9 is upsampled 4 times to match dimensions with input image size using transposed convolution with $(kernel=16, stride=8, padding=same)$. At the end we get the number of channels equal to the number of classes