Predicting multiple tokens at once, Meta’s new model inference is accelerated by 3 times, and programming tasks are increased by 17%.

Latest update time：2024-05-03

Reads：

The dreamy west wind comes from the Aofei Temple
Qubit | Official account QbitAI

"Predicting the next token" is considered the basic paradigm of large models. What about predicting multiple tokens at once ?

Meta AI French team launches "faster & better large model based on multi-token prediction".

The multi-token prediction model performs particularly well on programming tasks .

Compared with single-token prediction, the 13B parameter model solved 12% more problems on HumanEval and 17% more on MBPP.

On small algorithmic inference tasks , multi-token prediction also brings impressive gains in out-of-distribution generalization.

However, in natural language tasks, the multi-token prediction method cannot significantly improve the performance of the 7B model on mathematics multiple-choice questions.

Another benefit is that even if the batch size is large, the inference speed of the model trained using 4-token prediction can be increased by 3 times .

Multi-token prediction is more suitable for programming

Specifically, the team designed a new multi-token prediction architecture to predict n future tokens in parallel through n independent output heads.

Model training using large amounts of textual data, including code and natural language datasets.

Then through experiments, we compare the performance of multi-token prediction and single-token prediction on multiple downstream tasks.

Why does multi-token prediction improve more significantly in programming tasks and small algorithm reasoning tasks?

The team speculates that there may be two reasons:

First, the logical structure of the programming language is more rigorous and the internal connection of knowledge is closer. A key node may affect the direction of the entire subsequent code block. Multi-Token prediction can better capture this long-distance dependency.

Second, programming languages have smaller vocabularies compared to natural languages. Therefore, even if multiple Tokens are predicted each time, it is not that difficult. On the contrary, it forces the model to withdraw from local details and focus on global optimization.

In addition to experiments at the token level, the team has also tried more fine-grained byte-level models .

They found that after replacing the next byte prediction with 8-byte prediction, the model's Pass@1 indicator on MBPP increased by 67%, and it also increased by 20% on HumanEval.

And the reasoning speed can be 6 times faster, which is not too good.

Regarding the principle behind it, the team believes that multi-token prediction alleviates the distribution difference between Teacher Forcing during training and autoregressive generation during inference .

In other words, during training, all the model sees are standard answers, but it has to rely on itself when generating them. It's like when humans have answers in exercise books at home, but have nothing during exams, they will feel uncomfortable.

Multi-token prediction is equivalent to forcing the model to think a few more steps during training, so that it can cope with it freely in the examination room.

From the perspective of information theory, the team also gave a more precise argument.

The goal of traditional next Token prediction is to minimize the information entropy of the current position. The 2-Token prediction actually minimizes the sum of the information entropy of the current and next positions.

Mathematical derivation shows that the latter actually implies a greater weight of mutual information, that is, more emphasis is placed on the correlation between the current Token and the future Token. This is why multi-Token predictions are more "far-sighted".

However, there are several unresolved issues in this paper.

For example, it did not discuss how to automatically select the best number of predicted tokens n. The author proposed that in the future, we can study the use of loss weight adjustment or dynamic adjustment of n to solve the selection problem of the best n .

In addition, the optimal vocabulary size may also be different from that for single-token prediction.

In short, after reading this paper, everyone is looking forward to Llama-4 even more.

Paper address:
https://arxiv.org/abs/2404.19737

-over-

Click here ???? Follow me and remember to star~

Three consecutive clicks of "Share", "Like" and "Watching"

Advances in cutting-edge science and technology are seen every day ~

Latest articles about

■AI venom is all over Douyin and Xiaohongshu! Xianyu generates it for 10 yuan per time, but the official website is actually free

■The space-based intelligent version of ImageNet is here! Produced by Fei-Fei Li and Jia-Jun Wu’s team

■Multimodal models can be connected to the Internet without fine-tuning. A plug-and-play new framework is more effective than closed-source commercial solutions.

■Last week! 2024 Artificial Intelligence Annual Selection, the industry pioneers in the AI era are waiting for you

■The world's first legal o1 big model is released, slow thinking legal experts under the System2 paradigm | HKUST & Peking University

■Tsinghua University and Xiamen University proposed the "infinite length context" technology, which can find a needle in a million haystacks and make Llama\Qwen\MiniCPM score high

■Domestic AI can now shoot micro-movies! 4K, 60fps high-definition picture quality, with built-in sound effects

■Ant Group’s front-end technology team shares: What opportunities and changes will front-end development usher in under the wave of AI?

■AI protein published in Nature again after winning the Nobel Prize, with first-principles-level accuracy, a 4-year effort by Microsoft Research Asia

■A pop-up window confused Claude, and he suddenly couldn't use the computer | Stanford & HKU new research

Predicting multiple tokens at once, Meta’s new model inference is accelerated by 3 times, and programming tasks are increased by 17%.

The dreamy west wind comes from the Aofei Temple Qubit | Official account QbitAI

Multi-token prediction is more suitable for programming

Latest articles about

The dreamy west wind comes from the Aofei Temple
Qubit | Official account QbitAI