Article count:10350 Read by:146647018

Account Entry

Predicting multiple tokens at once, Meta’s new model inference is accelerated by 3 times, and programming tasks are increased by 17%.

Latest update time:2024-05-03
    Reads:
The dreamy west wind comes from the Aofei Temple
Qubit | Official account QbitAI

"Predicting the next token" is considered the basic paradigm of large models. What about predicting multiple tokens at once ?

Meta AI French team launches "faster & better large model based on multi-token prediction".

The multi-token prediction model performs particularly well on programming tasks .

Compared with single-token prediction, the 13B parameter model solved 12% more problems on HumanEval and 17% more on MBPP.

On small algorithmic inference tasks , multi-token prediction also brings impressive gains in out-of-distribution generalization.

However, in natural language tasks, the multi-token prediction method cannot significantly improve the performance of the 7B model on mathematics multiple-choice questions.

Another benefit is that even if the batch size is large, the inference speed of the model trained using 4-token prediction can be increased by 3 times .

Multi-token prediction is more suitable for programming

Specifically, the team designed a new multi-token prediction architecture to predict n future tokens in parallel through n independent output heads.

Model training using large amounts of textual data, including code and natural language datasets.

Then through experiments, we compare the performance of multi-token prediction and single-token prediction on multiple downstream tasks.

Why does multi-token prediction improve more significantly in programming tasks and small algorithm reasoning tasks?

The team speculates that there may be two reasons:

First, the logical structure of the programming language is more rigorous and the internal connection of knowledge is closer. A key node may affect the direction of the entire subsequent code block. Multi-Token prediction can better capture this long-distance dependency.

Second, programming languages ​​have smaller vocabularies compared to natural languages. Therefore, even if multiple Tokens are predicted each time, it is not that difficult. On the contrary, it forces the model to withdraw from local details and focus on global optimization.

In addition to experiments at the token level, the team has also tried more fine-grained byte-level models .

They found that after replacing the next byte prediction with 8-byte prediction, the model's Pass@1 indicator on MBPP increased by 67%, and it also increased by 20% on HumanEval.

And the reasoning speed can be 6 times faster, which is not too good.

Regarding the principle behind it, the team believes that multi-token prediction alleviates the distribution difference between Teacher Forcing during training and autoregressive generation during inference .

In other words, during training, all the model sees are standard answers, but it has to rely on itself when generating them. It's like when humans have answers in exercise books at home, but have nothing during exams, they will feel uncomfortable.

Multi-token prediction is equivalent to forcing the model to think a few more steps during training, so that it can cope with it freely in the examination room.

From the perspective of information theory, the team also gave a more precise argument.

The goal of traditional next Token prediction is to minimize the information entropy of the current position. The 2-Token prediction actually minimizes the sum of the information entropy of the current and next positions.

Mathematical derivation shows that the latter actually implies a greater weight of mutual information, that is, more emphasis is placed on the correlation between the current Token and the future Token. This is why multi-Token predictions are more "far-sighted".

However, there are several unresolved issues in this paper.

For example, it did not discuss how to automatically select the best number of predicted tokens n. The author proposed that in the future, we can study the use of loss weight adjustment or dynamic adjustment of n to solve the selection problem of the best n .

In addition, the optimal vocabulary size may also be different from that for single-token prediction.

In short, after reading this paper, everyone is looking forward to Llama-4 even more.

Paper address:
https://arxiv.org/abs/2404.19737

-over-

Click here ???? Follow me and remember to star~

Three consecutive clicks of "Share", "Like" and "Watching"

Advances in cutting-edge science and technology are seen every day ~



Latest articles about

 
EEWorld WeChat Subscription

 
EEWorld WeChat Service Number

 
AutoDevelopers

About Us Customer Service Contact Information Datasheet Sitemap LatestNews

Room 1530, Zhongguancun MOOC Times Building,Block B, 18 Zhongguancun Street, Haidian District,Beijing, China Tel:(010)82350740 Postcode:100190

Copyright © 2005-2024 EEWORLD.com.cn, Inc. All rights reserved 京ICP证060456号 京ICP备10001474号-1 电信业务审批[2006]字第258号函 京公网安备 11010802033920号