With the network depth increasing accuracy gets saturated and it is not a problem of overfitting, as it might first appear. In simple words if we add some additional layers to pretraied network, new bigger network should work as good as the shallow or even better. Resnets solve this problem, namely vanishing gradient problem.
Let us focus on a local (a few layers of a network) neural network with desired mapping $H(x)$. We let this network fit another mapping $F(x) = H(x) - x$. The original mapping is recast into $F(x) + x$.
Author’s hypothesis is that it is easier to optimize this residual mapping $F(x)$ than to optimize the original unreferenced mapping $H(s)$.
Formally building block (or residual unit) performs the following computation:
\[
\begin{split}
&y_l = x_l + F(x_l, W_l) \\\
&x_{l + 1} = f(y_l)
\end{split}
\tag{1}\label{eq1}
\]
where $f$ is activation function, $W_l$ is weights matrix of the layer $l$.
We see that dimensions of $x_l$ and $F$ must be equal. If this is not the case we perform linear projection $A$ which can be trainable or simply pad 0s to $x_l$.
\[ \begin{equation} y_l = Ax_l + F(x_l, W_l) \end{equation} \]
Authors considered general form of the first equation: \[ \begin{equation} y_l = h(x_l) + F(x_l, W_l) \end{equation} \] and they showed by experiments that the identity mapping is sufficient for addressing this problem.
Let us take as activation function $f(x) = ReLU(x)$: \[ \begin{equation} x_{l+1} = x_l + F(x_l, W_l) \end{equation} \]
\[ \begin{equation} x_{L} = x_l + \sum_{i=l}^{i=L-1}F(x_i, W_i) \end{equation} \]
Denoting loss functions as $C$, from the chain rule of backpropagation we get: \[ \begin{equation} \frac{\partial C}{\partial x_l} = \frac{\partial C}{\partial x_L} \frac{\partial x_L}{\partial x_l} = \frac{\partial C}{\partial x_L} (1 + \frac{\partial}{\partial x_l} \sum_{i=l}^{i=L-1}F(x_i, W_i) ) \end{equation} \] We see that the term of $\frac{\partial C}{\partial x_L}$ ensures that information is directly propagated back to any shallower unit $l$. And it is suggested it is unlikely that the gradient $\frac{\partial C}{\partial x_L}$ to be canceled out for minibatch, because in general term $\sum_{i=l}^{i=L-1}F(x_i, W_i)$ cannot be always $-1$ for all samples in minibatch [2].
inception_v2_1.png