530 billion NLP model "Megatron-Turing" released, trained by 4480 A100s, jointly produced by Microsoft and NVIDIA

Latest update time：2021-10-12 16:32

Reads：

Fengse from Aofei Temple
Quantum Bit Report | Public Account QbitAI

530 billion parameters ! The world's largest NLP model is born.

Launched by Microsoft and NVIDIA , it is called the Megatron Turing Natural Language Generation Model (Megatron Turing-NLG) .

According to them, this scale not only makes it the world's largest, but also the most powerful NLP model.

A total of 4,480 NVIDIA A100 GPUs were used in the training process , ultimately enabling the model to achieve unprecedented accuracy in a range of natural language tasks, including text prediction, reading comprehension, commonsense reasoning, natural language reasoning, and word sense disambiguation.

Three times the size of GPT-3

This model is referred to as MT-NLG and is the "successor" of both Microsoft Turing NLG and Nvidia Megatron-LM .

Turing NLG was launched by Microsoft in February 2020 with 17 billion parameters; Megatron-LM comes from NVIDIA and was launched in August 2019 with 8.3 billion parameters.

They were the first and second largest Transformer architecture models at the time.

We all know that language models with large parameter sizes work better, but they are also challenging to train, such as:

Even the largest GPU cannot store parameters of this size.
The large number of computational operations required can result in prohibitively long training times if careful attention is not paid to optimizing the algorithm, software, and hardware stack.

So how does MT-NLG solve this problem when its parameters are three times that of GPT-3?

The answer is to draw on the strengths of both companies , integrating NVIDIA's most advanced GPU accelerated training equipment and Microsoft's most advanced distributed learning system to increase training speed.

We also built a corpus with hundreds of billions of tokens and jointly developed training methods to optimize efficiency and stability.

Specifically, a 3D parallel system was created by drawing on the GPU parallel processing of NVIDIA's Megatron-LM model and Microsoft's open source distributed training framework DeepSpeed.

For the 530 billion parameter model in this article, each model replica spans 280 NVIDIA A100 GPUs, using Megatron-LM’s 8-way tensor- slicing within a node and 35-way pipeline parallelism between nodes .

We then use DeepSpeed’s data parallelism to further scale to thousands of GPUs.

Finally, mixed precision training was completed on the Selene supercomputer based on NVIDIA DGX SuperPOD.

(The supercomputer is powered by 560 DGX A100 servers, each with eight NVIDIA A100 80GB Tensor Core GPUs, fully connected to each other via NVLink and NVSwitch).

The model uses the Transformer decoder architecture with 105 layers, 20480 hidden dimensions, and 128 attention heads.

The data sets used for training include the plain text data set Books3 of nearly 200,000 books, the question-and-answer website Stack Exchange, Wikipedia, the academic resource website PubMed Abstracts, ArXiv, Wikipedia, GitHub, etc. These are all high-quality subsets selected from the Pile data set they built previously.

In the end, a total of 270 billion tokens were withdrawn.

Accuracy test on five tasks

The developers tested the accuracy of MT-NLG on the following five tasks.

In the text prediction task LAMBADA, the model needs to predict the last word of a given paragraph.
In the reading comprehension tasks RACE-h and BoolQ, the model needs to generate answers to questions based on a given paragraph.
In the common sense reasoning tasks PiQA, HellaSwag, and Winogrande, each task requires the model to have a certain degree of common sense understanding.
For natural language inference, two hard benchmarks, ANLI-R2 and HANS, test typical failure cases of previous models.
The word sense disambiguation task WiC requires the model to understand polysemous words from the context.

As a result, the model achieved the highest results in the zero-shot, one-shot, and few-shot settings on the PiQA development set and the LAMBADA test set.

It also received the best results in other tasks.

In addition to reporting summary metrics on the benchmark tasks, they also provide a qualitative analysis of the model output and observe that the model can infer basic mathematical operations from context even when the symbols are heavily obfuscated.

Of course, the model also picks up stereotypes and biases from the data, something Microsoft and Nvidia say they are also addressing.

In addition, they stated that the use of MT-NLG in production scenarios must comply with Microsoft's "Responsible AI Principles" to reduce the negative impact of the output content, but the model has not yet been made public.

Reference Links:

https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/

-over-

This article is the original content of [Quantum位], a signed account of NetEase News•NetEase's special content incentive plan. Any unauthorized reproduction is prohibited without the account's authorization.

List collection! 6 major awards for top AI companies

Registration for the "2021 China Artificial Intelligence Annual Selection" is open! This selection will look for outstanding AI companies from three dimensions: company, person, and product. Welcome to scan the QR code to register and participate. The selection will be announced in December. We look forward to witnessing the honor of these outstanding companies with millions of practitioners!

Click the link to view the selection details: 2021 China Artificial Intelligence Annual Selection Starts: Let more people see the true value of AI

click here