Google intern's new algorithm speeds up amazingly! BERT training is shortened from three days and three nights to one hour

Latest update time：2021-08-31 18:43

Reads：

Guo Yipu Annie from Aofei Temple
Quantum Bit Report | Public Account QbitAI

How long does it take to train the most powerful language AI from scratch? Now, this time has been shortened from three days and three nights to just over an hour !

The person who brought this progress is Google Brain intern You Yang. This guy is from Henan, China. He was once the first in Tsinghua University's computer science master's program and is currently studying for a doctorate at the University of California, Berkeley.

A recent study he completed has increased the BERT pre-training speed by 64 times, from 4860 minutes to 76 minutes and 11 seconds.

After training, when tested on the machine question answering dataset SQuAD-v1, the F1 score was slightly higher than the original three days and three nights version.

What magical skills did other people’s interns use?

Time-consuming to time-saving

To shorten the training time of neural networks, there are mature methods available:

Two-pronged approach: first, add a large number of CPUs, GPUs or TPUs to increase computing power ; second, increase the batch size to reduce the number of iterations.

This method is commonly used in computer vision research. A few days ago, researchers from Fujitsu used this method to train ResNet-50 on ImageNet in 74.7 seconds.

However, these training methods in the field of vision do not work when applied to BERT. BERT is currently the most time-consuming application to train in the industry, and its computational complexity is much higher than ImageNet.

In addition, there is a "common problem" in large-scale training, which is that it will produce generalization errors (Generalization Gap), resulting in a decrease in the generalization ability of the network . Such direct optimization often leads to a decrease in accuracy on the test set.

what to do?

In order to train BERT with large batches, You Yang and his colleagues proposed the LAMB optimizer, a general neural network optimizer that can be used for both large and small batch networks, without the need to tune hyperparameters other than the learning rate.

Relying on LAMB, which is also applicable to ultra-large batches, they expanded the batch size from 512 to 65536 .

What does 65536 mean? This has reached the limit of TPU memory, and it is also the first time that a study has used a large batch size of more than 2000 to train BERT.

As a result, the number of iterations is greatly reduced. Previously, the BERT-Large model required 1,000,000 iterations to complete the pre-training process, which took 81.4 hours. With the support of LAMB and the use of large batches, only 8,599 iterations are required, and the pre-training time is directly shortened to 76 minutes .

This makes the speedup 64 times !

I heard that, what kind of magical weapon is this LAMB?

LAMB Optimizer

Its full name is Layer-wise Adaptive Moments optimizer for Batch training . It is the same as the familiar SGD and Adam, both of which are optimizers for machine learning models.

Originally, the three-day and three-night BERT training used the Adam optimizer with weight decay.

The new optimizer LAMB was inspired by a study by the first author of the paper, You Yang, in 2017. At that time, he proposed an optimizer LARS for large-scale convolutional neural networks.

LARS uses the coefficient eeta to control the trust ratio, but this approach may cause some problems and create some differences.

Therefore, in LAMB, the research team deleted eeta and directly set the trust rate to 1.0 for layers with 0|w| or 0|g|, eliminating the differences in BERT training.

In addition, LARS uses weight decay to calculate the trust rate:

Taking weight decay into account, the trust rate formula in LAMB is changed to this:

In addition, although LARS runs well on ImageNet, it uses much fewer model parameters than BERT on ImageNet. Therefore, this time, the research team used the

Changed to:

Through a series of changes, the maximum batch size of the LAMB optimizer has been increased to 32K.

True Optimization

Whether it's a mule or a horse, it's time to take it out for a walk.

The researchers used conventional training and mixed batch training to test the optimization results of the LAMB optimizer, and the actual results were good.

In the test, they greatly increased the computing power and chose a 1024-core TPUv3 Pod for training. The 1024 TPU cores can provide more than 1 billion floating-point operations (100 petaflops) of mixed precision operations per second.

Finally, the researchers decided to pre-train the model using the Wikipedia and ooksCorpus datasets, just like the original BERT model, and then test it using Stanford's SQuAD-v1 dataset. The F1 Score value obtained from the test was used to measure accuracy.

The results show that as the batch size increases, the number of iterations decreases, the fluctuation of F1 Score is not significant, the F1 value remains above 90, but the training time is significantly shortened.

△ Test results

When the batch size is larger than 65536 and the sequence length reaches 128, there is no significant reduction in training time.

When 1024 TPUs are used, the batch size is 32768 or 65536, and the iterations are 8599 times, the training time is reduced to a minimum, and the pre-training process only takes 76.19 minutes.

And, finally, a weak scaling efficiency of 101.8% was achieved.

Academic master's first work: first place in both undergraduate and graduate studies

The author of this study is You Yang, a doctoral student in the Department of Computer Science at UC Berkeley and an intern at Google Brain.

The young man is a top student in college. He majored in computer science at China Agricultural University and was the top student in his class. He was admitted to the Computer Science Department of Tsinghua University for his master's degree. Among the 134 outstanding students who entered Tsinghua University, he was still the top student.

As the top student, when applying for a doctorate, You Yang received full scholarship offers from six prestigious universities, including UC Berkeley, CMU, University of Chicago, UIUC, Georgia Institute of Technology, and Northwestern University. He had his pick of prestigious universities.

So, he chose UC Berkeley from six prestigious universities. UC Berkeley happens to be located in the Bay Area. As a result, You Yang had the opportunity to intern at well-known companies and research institutes such as Google Brain, Intel Lab, Microsoft Research, NVIDIA, and IBM Watson Research Center. He took the opportunity of the internship to contribute to large-scale well-known open source projects such as TensorFlow, deploying caffe on NVIDIA GPU, and deploying caffe on Intel CPU.

I even had the opportunity to go to Huang Renxun's house for a party during my internship. It's really enviable.

△ I didn’t wear a leather jacket today

In addition, You Yang is also a prolific paper producer, with more than a dozen papers as the first author at top conferences, including the best paper at last year's ICPP and the best paper at IPDPS 2015. He also won the Siebel Scholar Award in 2014.

Portal

Reducing BERT Pre-Training Time from 3 Days to 76 Minutes
Yang You, Jing Li, Jonathan Hseu, Xiaodan Song, James Demmel, Cho-Jui Hsieh
https://arxiv.org/abs/1904.00962

-over-

Quantumbit AI+ Salon Series--Smart City

Join the community

The QuantumBit AI community has started recruiting. The QuantumBit community is divided into: AI discussion group, AI+ industry group, and AI technology group;

Students who are interested in AI are welcome to reply to the keyword "WeChat group" in the dialogue interface of the Quantum Bit public account (QbitAI) to obtain the group entry method. (The technical group and AI+ industry group need to be reviewed and the review is strict, please understand)

Sincere recruitment

Qbit is recruiting editors/reporters, and the work location is Beijing Zhongguancun. We look forward to talented and enthusiastic students to join us! For relevant details, please reply to the word "recruitment" in the dialogue interface of the Qbit public account (QbitAI).