Forwarded article: A Complete Guide to Convolutional Neural Networks (CNN). It is well written.

freebsder · Published on 2022-11-15 17:07

Forwarded article: A Complete Guide to Convolutional Neural Networks (CNN). It is well written. [Copy link]

Reprinted from: https://www.toutiao.com/article/7149130033838604834

1] Introduction

Let me start with a little off-topic talk...

After I entered graduate school, my supervisor forced me to learn about neural networks. I was very blind at first. I searched a lot of information online. There were endless articles like "Understand Convolutional Neural Networks in One Article", "Teach You to Build Your Own Neural Network Framework in Three Minutes", "Quickly Read the Complete Solution of Neural Networks in Five Minutes", etc. I read too many articles, and the result was that I couldn't really get started after studying for a long time.

Then I slowly understood the true meaning of convolutional neural networks through my own hard work. (So official hahahahahaha)

First of all, the most important thing to be clear is that convolutional neural networks (CNN) have been used in various fields, such as object segmentation, style transfer, automatic coloring, etc. However, CNN can only act as a feature extractor! Therefore, these applications are based on CNN's feature extraction of images.

In this article, I am not going to introduce biological neurons, synapses, etc. like traditional articles introducing CNN. I will start directly with the simplest example.

Without further ado, let’s get started.

When you get a picture and want to identify it, the simplest example is, what is this picture?

For example, I want to train a simplest CNN to recognize whether the letter in a picture is X or O.

When we see it, it is very simple, obviously it is X, but the computer does not know, it does not understand what X is. So we add a label to this picture, commonly known as Label, Label=X, to tell the computer that this picture represents X. It remembers the appearance of X.

But not all Xs look like this. For example...

These four are all Xs, but they are obviously different from the previous X. The computer has never seen them before and doesn’t recognize them.

(Here we can bring up the cool-sounding term " underfitting " in machine learning)

What to do if you don't recognize it? Of course, you can recall whether you have seen something similar before. At this time, what CNN needs to do is to extract the features of the picture with content X.

We all know that images are stored in the computer as pixel values, which means that the two Xs actually look like this to the computer.

Where 1 represents white and -1 represents black.

It is definitely unscientific to compare each pixel one by one, the results will be incorrect and the efficiency will be low, so other matching methods are proposed.

We call this patch matching.

Looking at these two X images, we can see that although the pixel values cannot correspond one to one, there are some common points.

As shown in the picture above, the structures of the three same-colored areas in the two pictures are exactly the same!

Therefore, we consider that to link these two images, we cannot match all pixels, but can we match them locally?

The answer is of course yes.

It is equivalent to if I want to locate the face in a photo, but CNN does not know what a face is, I will tell it: there are three features on the face, what the eyes, nose and mouth look like, and then tell it what these three features look like. As long as CNN searches the entire image and finds the place where these three features are, the face is located.

Similarly , from the standard X graph we extract three features :

We found that we can locate a certain part of X by using only these three features.

Feature is also called convolution kernel (filter) in CNN, which is usually 3X3 or 5X5 in size.

【2】Convolution operation

After talking for so long, we finally get to the word convolution!

But! ! My friends! Convolutional neural network and the convolution operation in signal processing! Have nothing to do with each other! I even reviewed the convolution operation in advanced mathematics! Damn!

These!! Have nothing to do with our CNN!!!

(Second draft revision: After being reminded by a friend, I am wrong here. Convolutional neural networks are still related to convolution operations in essence and principle. It’s just that I was not well-educated before and failed to see the actual connection between the two. Please forgive me if there is any misleading information. Sorry!)

Okay, let's continue with how to calculate. Four words: corresponding multiplication.

See the picture below.

Take the value of the (1, 1) element in the feature, and then take the value of the (1, 1) element in the blue box on the image, and multiply the two to 1. Fill this result 1 into the new image.

Similarly, continue to calculate the values at the other 8 coordinates.

After all 9 are calculated, it will become like this.

The next step is to average the nine values in the right graph, get a mean, and fill the mean into a new graph.

We call this new picture a feature map .

Some of you may raise your hands and ask, why is the blue box placed in this position in the picture?

This is just an example. We call this blue box a "window", and the characteristic of a window is that it can slide.

In fact, at the beginning, it should be in the starting position.

After performing the convolution and multiplication operations and obtaining the mean, the sliding window starts to slide to the right. The sliding amplitude is selected according to the step size.

For example, if stride=1, it will shift one pixel to the right.

If stride=2, then shift two pixels to the right.

After moving to the rightmost position, it returns to the left and starts the second row. Similarly, if stride=1, it moves down by one pixel; if stride=2, it moves down by 2 pixels.

OK, after a series of convolutions, multiplications, and mean operations, we finally fill a complete feature map.

The feature map is the "feature" extracted from the original image for each feature. The closer the value is to 1 , the more complete the match between the corresponding position and the feature is. The closer it is to -1, the more complete the match between the corresponding position and the opposite of the feature is. The value close to 0 means that there is no match or no association between the corresponding position .

A feature acts on an image to produce a feature map. For this image X, we use 3 features, so 3 feature maps are finally produced.

At this point, the convolution operation is finished! ~

【3】Non-linear activation layer

The convolution layer performs multiple convolutions on the original image to produce a set of linear activation responses, while the non-linear activation layer performs a non-linear activation response on the previous result.

This is a very official statement. I wonder if you all feel dizzy after reading the above sentence.

Hmm~ o(*￣▽￣*)o It’s actually not that complicated!

This series of articles adheres to the principle of "speaking human language!" and strives to use the simplest and most popular language to explain those concepts in the book that are difficult to understand.

The most commonly used nonlinear activation function in neural networks is the Relu function, and its formula is defined as follows:

f(x)=max(0,x)

That is, values greater than or equal to 0 are retained, and all other values less than 0 are directly rewritten to 0.

Why do we do this? As mentioned above, the closer the value in the feature map generated by convolution is to 1, the more relevant it is to the feature, and the closer it is to -1, the less relevant it is. When we extract features, in order to make the data less and the operation more convenient, we directly discard those irrelevant data.

As shown in the figure below: >=0 value remains unchanged

All values < 0 are rewritten as 0

The result after the nonlinear activation function is obtained:

【4】Pooling layer

After the convolution operation, we get feature maps with different values. Although the amount of data is much less than the original image, it is still too large (compared to deep learning, which often requires hundreds of thousands of training images). Therefore, the following pooling operation can play a role. Its biggest goal is to reduce the amount of data.

There are two types of pooling: Max Pooling and Average Pooling. As the name implies, Max Pooling takes the maximum value, while Average Pooling takes the average value.

Take the maximum pooling as an example: the pooling size is selected as 2x2, because a 2x2 window is selected, and the maximum value is selected within it and updated into the new feature map.

Similarly, slide the window to the right according to the step size.

Finally, we get the pooled feature map. It can be clearly seen that the amount of data has been reduced a lot.

Because the maximum pooling retains the maximum value in each small block, it is equivalent to retaining the best matching result of this block (because the closer the value is to 1, the better the match). This means that it will not pay attention to which part of the window is matched, but only to whether there is a match somewhere. This also shows that CNN can find out whether there is a certain feature in the image, without caring about where the feature is. This can also help solve the rigid practice of computer pixel-by-pixel matching mentioned earlier.

Here we have introduced the basic configuration of CNN - convolutional layer, Relu layer, and pooling layer.

In several common CNNs, these three layers can be stacked and the input of the previous layer is used as the output of the next layer. For example:

You can also add more layers to achieve more complex neural networks.

The final fully connected layer, neural network training and optimization will be discussed in the next article.

火辣西米秀 · Published on 2022-11-16 07:35

I read the Complete Guide to Convolutional Neural Networks and found it very powerful.

lugl4313820 · Published on 2022-11-16 07:59

I seem to understand a little more, thank you for sharing such a good post!

luxgnj

After reading your explanation, I feel like I understand a lot, thank you!

kit7828

Thank you very much, I understand a lot of things that I didn't understand before

heleijunjie72

This article introduces convolutional neural networks very well, and is worth learning and referring to.

通途科技

A great article: This article introduces convolutional neural networks very well with pictures and texts, which is worth learning and referring to.

Forwarded article: A Complete Guide to Convolutional Neural Networks (CNN). It is well written. [Copy link]

Latest reply