He Kaiming "ends" the era of ImageNet pre-training: training neural networks from scratch, with results comparable to COCO champions

Latest update time：2018-11-23

Reads：

Xia Yi Anni sent this from Aofei Temple
Produced by Quantum Bit | Public Account QbitAI

Kaiming He, RBG, Piotr Dollár.

The three great partners who have been working together since Mask R-CNN have just teamed up again and "ended" the era of ImageNet pre-training in one article.

What they are targeting is a common operation in current computer vision research: no matter what the task is, take the ImageNet pre-trained model and do transfer learning.

But is pre-training really necessary?

This article Rethinking ImageNet Pre -training gives their answer.

Three researchers from FAIR (Facebook AI Research) trained a neural network from random initial states and then tested it on the COCO dataset object detection and instance segmentation tasks. The results were in no way inferior to those pre-trained on ImageNet.

It can even compete with the COCO 2017 champion without pre-training or external data.

result

Pictures show the training effect.

They trained a Mask R-CNN model using the 2017 version of the COCO training set, and the backbone network was a ResNet-50 FPN with group normalization (GroupNorm).

Subsequently, the bounding box average detection rate (AP) of the two methods, random weight initialization (purple line) and ImageNet pre-training followed by fine-tuning (grey line), is evaluated on the corresponding validation set.

It can be seen that the random weight initialization method is not as good as the pre-training method at the beginning, but as the number of iterations increases, it gradually achieves results comparable to the pre-training method.

In order to explore various training schemes, He Kaiming and others tried to reduce the learning rate in different iteration cycles.

The results show that the model trained by the random initialization method requires more iterations to converge , but the final convergence effect is not worse than the pre-trained and fine-tuned model .

When the backbone network is replaced with ResNet-101 FPN, this method of training from scratch still shows the same trend: the AP from scratch is not as good as the pre-training method, but after multiple iterations, the two are finally on par.

How good is the result? The answer has been given before, on par with the COCO 2017 champion.

The effect of the model from scratch is demonstrated by the COCO object detection task. On the 2017 validation set, the model's bbox (bounding box) and mask (instance segmentation) APs are 50.9 and 43.2 respectively;

They also submitted this model in the 2018 competition, with bbox and mask AP of 51.3 and 43.6 respectively.

This result is the best among single models that have not been pre-trained on ImageNet.

This is a very large model, using the ResNeXt-152 8×32d backbone and GN normalization method. From this result, we can also see that this large model has no obvious overfitting and is very robust.

In the experiment, He Kaiming and others also used ImageNet to pre-train the same model and then fine-tuned it, but there was no improvement in the results.

This robustness has other manifestations as well.

For example, even with less data, the effect is still comparable to the pre-training and fine-tuning method. He Kaiming described this result as " even more surprising " in his paper.

When they reduced the number of training images to 1/3 (35,000 images) or even 1/10 (10,000 images) of the entire COCO dataset, after many iterations, random initialization seemed to be slightly better than the pre-training method.

However, 10,000 images is already the limit, and further reducing the amount of data will not work. When they reduced the training data to 1,000 images, obvious overfitting occurred.

How to train?

If you want to abandon ImageNet pre-training, you don’t need to go to great lengths to come up with a new architecture. However, two small changes are inevitable.

The first point is the normalization method of the model, and the second point is the training length.

Let's talk about model normalization first .

Because the input data of the target detection task usually has a high resolution, the batch size cannot be set too large. Therefore, Batch Normalization (BN) is not suitable for training target detection models from scratch.

Therefore, He Kaiming and others found two feasible methods from recent research: Group Normalization (GN) and Synchronized Batch Normalization (SyncBN).

GN was proposed by Wu Yuxin and He Kaiming, and was published in ECCV 2018, and was nominated for the best paper. This normalization method divides channels into groups and then calculates the mean and variance within each group. Its calculation is independent of the batch dimension, and the accuracy is not affected by the batch size.

SyncBN comes from MegDet of Megvii and the CVPR 2018 paper Path Aggregation Network for Instance Segmentation by Shu Liu et al. of the Chinese University of Hong Kong. This is a method to implement BN by calculating batch statistics across GPUs, which increases the effective batch size when using multiple GPUs.

After the normalization method is selected, attention should also be paid to the convergence problem. In simple terms, it is necessary to train for a few more cycles.

The reason is simple: you can't expect a model to converge as fast as a pre-trained model when trained from a randomly initialized state.

So, be patient and train for a while.

The figure above is a comparison of the two methods. Assuming that the fine-tuned model has been pre-trained for 100 cycles, the number of cycles required for the model trained from scratch is three times that of the fine-tuned model to see a similar number of pixels. The number of samples at the instance level and image level is still very different.

That is to say, in order to start training from a randomly initialized state, a large number of samples are required.

Should we use ImageNet pre-training?

This paper also intimately released several conclusions summarized from the experiment:

It is feasible to train from scratch for a specific task without changing the architecture.
Training from scratch requires more iterations to fully converge.
In many cases, even with only 10,000 COCO images, training from scratch is just as effective as fine-tuning a pre-trained ImageNet model.
Using ImageNet pre-training can accelerate convergence on the target task.
ImageNet pre-training may not necessarily reduce overfitting unless the amount of data is extremely small.
If the target task is more sensitive to localization than recognition, ImageNet pre-training will be less useful.

Therefore, several key questions about ImageNet pre-training have been answered:

Is it necessary? No, as long as the target dataset and computing power are sufficient, training can be done directly. This also shows that to improve the performance of the model on the target task, it is more useful to collect target data and annotate it, rather than adding pre-training data.

Does it help? Of course it does. It can bring significant improvements when there is insufficient data on the target task, circumvent some optimization problems of the target data, and shorten the research cycle.

Do we still need big data? Yes, but general large-scale classification-level pre-training datasets are not needed. It is more efficient to collect data in the target field.

Should we still pursue universal representation? Yes, and it is still a laudable goal.

For a deeper understanding of this problem, please read the paper:

Portal

paper:

Rethinking ImageNet Pre-training
Kaiming He, Ross Girshick, Piotr Dollár
https://arxiv.org/abs/1811.08883

-over-

Annual selection application

Join the community

The QuantumBit AI community has started recruiting. Students who are interested in AI are welcome to reply to the keyword "communication group" in the dialogue interface of the QuantumBit public account (QbitAI) to obtain the way to join the group;

In addition, professional qubit sub-groups (autonomous driving, CV, NLP, machine learning, etc.) are recruiting for engineers and researchers working in related fields.

To join the professional group, please reply to the keyword "professional group" in the dialogue interface of the Quantum Bit public account (QbitAI) to obtain the entry method. (The professional group has strict review, please understand)

Sincere recruitment

Qbit is recruiting editors/reporters, and the work location is Beijing Zhongguancun. We look forward to talented and enthusiastic students to join us! For relevant details, please reply to the word "recruitment" in the dialogue interface of the Qbit public account (QbitAI).