3

An Interesting Idea toward CNN — Residual

 2 years ago
source link: https://medium.com/@sunnerli/an-interesting-idea-toward-cnn-residual-4bb54040b9a
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

An Interesting Idea toward CNN — Residual

Introduction

In 2015, the champion of ILSVRC is the group which be lead by Kaiming He, the famous researcher in MS. From then on the concept of residual became more and more famous. Moreover, the depth of the network grew quickly. In this article, I want to show the argument story between usual CNN and residual network.

The number of layer about the champions in ILSVRC

ResNet — The 1st of ILSVRC in 2015

The whole name of ResNet is “residual network”. During the years experiment, Kaiming He [1] observed a phenomenon: In theory, the deeper the network it is, the more accurate the result is shown. However, when he added more layers into traditional CNN model, the training error and testing error were both increasing! The result can be shown in the following statistic.

The training error and testing error which announce by ResNet paper

He gave a guessing about this result: It might not be the over-fitting. On the contrary, it’s the disadvantage of the deep network. Once you add the more layers, they cannot learn very well through the whole parameters. He also gave the name to this result: degradation problem. After this experiment, the residual network was created.

The concept of plain network and residual network

A fundamental question is: what is residual? In machine learning concept, we want to learn the hypotheses h(x) such that h(x) is similar to the target function F(x). The h(x) is just like the identity mapping from the feature domain to the target domain, and it can be shown at the left side of the above image.

However, the residual is different. In residual network, the model wants to learn the hypothesis h(x) such that h(x) is similar to the difference between the target function F(x) and the input x. On the other word, we want to find the hypothesis which follows the format: h(x) + x = F(x). The h(x) is just like the difference between the feature domain to the target domain.

The gradient derivative inference toward plain network and residual network

Another question is: can the residual network be trained by back propagation? In the discussion, we assume the layer as the simple fully-connected layer. You can extend the situation to the other types of layer in the future. The left part of the above image shows my inference toward the stacked traditional network.

However, even though we should add the input term to the output computing function, the gradient function doesn’t changed a lot. Most of the important, it doesn’t generate other extra terms which you should consider with troublesome. Simplified to say, the residual network can be still learned by stochastic gradient decent!

The training error of plain network and residual network

After this changing, the above image illustrates the experiment results. The left side is the convergence of plain network; the right side is the convergence of residual network. As you can see, although deeper plain network gets higher error rate, the deeper residual network overcomes the degradation problem in contrast.

RiR —The Link between Residual and Feature

After the success of ResNet, the other groups of people think of an idea: why can we combine the concept of the feature and the residual? The traditional CNN can learn the abstract feature toward the image, while the residual CNN can learn the residual difference which hide behind the image. Can we combine the both advantage to get the better performance?

The structure of usual CNN, ResNet and RiR

As the result, ResNet in ResNet (RiR) was invested [2]. The above image shows the three different structure of CNN. The left one is the usual CNN which only learns the abstract feature representation. The middle one is the ResNet which has two skip-connections.

The right one is the structure of RiR. As you can see, the network has two generalized residual blocks, and the whole compute process can be divided as two path: residual stream and transient stream. The two stream will merge the result after the blocks work, and reduce dimension by global average pooling.

The structure of generalized residual block

I put the original paper image in above. In each streams, there’re two different convolution layer. One of the layer is the majority while the other one is auxiliary. The major result will do the element-wise addition with the auxiliary result of the other stream, and pass the result to the next block.

The training convergence phenomenon toward CIFAR10

I also do the implementation about RiR. Moreover, I use CIFAR10 and MNIST datasets to evaluate the performance of the models. The above image shows the result toward CIFAR10. In the experiment, the model is trained in 4 epochs which select 400 bagging image randomly. The RiR really combines the both advantage and gets the best performance.

The training convergence phenomenon toward MNIST

However, the result isn’t great in the other experiment. The above image illustrates the result toward MNIST dataset. In this experiment, the model is trained only in 2 epochs which still selects 400 bagging image randomly. The reason I guess is that the RiR model cannot do very well while the feature domain isn’t very complex. The other probable reason is that the it’s lack of training.

FractalNet — Plain Network with Wide

Some people might question toward the residual concept: Is the residual concept very powerful? Why can’t the feature concept do very well just like residual? Gustav[3] gave the below criticism. As the result, he purposed the FractalNet structure[3].

Residual representations may not be fundamental to the success of extremely deep convolutional neural networks. Rather, the key may be the ability to transition, during training, from effectively shallow to deep.

The main idea of FractalNet is that the convolution layer should sophisticate to design. The left-top area of the following image shows the main idea how to fractal the compute path. In each compute column, the result will join the stacked result with the extra convolution output.

The structure of FractalNet

The right part of the above image shows the structure of fractal block and the whole model. In the fractal block, the feature map will be compute by different depth of stacked sub-network. The pooling layer will be concatenated behind the fractal block.

One concept should be notice! The join operation will consider toward each channel of feature map. For each channel, the layer will compute the element-wise average value among the input tensor. As the result, the shape of the output tensor will not change.

The examples of local and global drop path

Another creative idea is drop path which was announced in this paper. There’re two types of drop path: local and global.

  1. Local drop path: The each join layer will disable the input result randomly. However, each join layer will guarantee there’s at least 1 input tensor.
  2. Global drop path: The whole model should choose a fixed index of column path, and disable the whole other compute path in contrast.

By this design, the model can get the two benefits:

  1. The drop path is similar to the dropout. As the result, the regularization property can avoid the different path becoming co-adapting.
  2. Since the depth of the model is flexible, the training process can be regard as the stochastic depth mechanism which is also a kind of regularization.
The result of FractalNet toward CIFAR100

The above table shows the result which use CIFAR100 as the dataset. As you can see, the plain network gets around 39% error rate. However, the fractalNet improves a lot and reduces to around 27%. Although fractalNet cannot beat the performance of ResNet, the improve level is still a good result.

Conclusion

Residual concept becomes more and more popular. The performance can rise without adding others parameters. On the other hand, the wide compute layer might be a good idea to enhance the performance as well.

The link of the RiR implementation is here. You can check the residual concept in more detail.

Reference

[1] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” ArXiv:1512.03385v1 [cs.CV], Dec. 2015.

[2] S. Targ, D. Almeida, and K. Lyman, “Resnet in Resnet: Generalizing Residual Architectures,” ArXiv:1603.08029 [cs.LG], March 2016.

[3] G. Larsson, M. Maire, and G. Shakhnarovich, “FractalNet: Ultra-Deep Neural Networks without Residuals,” ArXiv:1605.07648v4 [cs.CV], May 2016.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK