Small model challenges large model with 14 times more parameters, Google launches new scaling law on test-time side
The west wind blows from Aofei Temple
Quantum Bit | Public Account QbitAI
Without increasing model parameters, with the same computing resources, the small model outperforms the model 14 times larger !
Google DeepMind's latest research has sparked heated discussions, and some even said that this may be the method used by OpenAI's upcoming new model Strawberry.
The research team explored methods to optimize computation when reasoning with large models, dynamically allocating test-time computing resources based on a given prompt difficulty .
It was found that this method is more cost-effective than simply expanding model parameters in some cases .
In other words, it may be a better strategy to spend less computational resources in the pre-training phase and more in the inference phase.
Using extra computation to improve output at inference time
The core question of this study is -
Different computational strategies have significantly different effectiveness for different problems when solving the prompt problem within a certain computational budget. How should we evaluate and choose the test-time computational strategy that is most suitable for the problem at hand? How does this strategy compare to just using a larger pre-trained model?
The DeepMind research team explored two main mechanisms to expand computation at test time.
One is to search for a dense process-based validator reward model (PRM) .
PRM can provide scores at each step in the process of model generation of answers to guide the search algorithm, dynamically adjust the search strategy, and help avoid wasting computing resources on these paths by identifying incorrect or inefficient paths during the generation process.
Another approach is to adaptively update the model's response distribution based on the prompt at test time .
Instead of generating a final answer all at once, the model gradually modifies and improves the answers it previously generated, revising them sequentially.
The following is a comparison between parallel sampling and sequential revision. Parallel sampling generates N answers independently, while sequential revision is that each answer depends on the result of the previous generation and is revised step by step.
By studying these two strategies, the team found that the effectiveness of different methods highly depends on the difficulty of the prompt.
Therefore, the team proposed a "computationally optimal" expansion strategy to adaptively allocate computing resources during testing according to prompt difficulty .
They classified the problems into five levels of difficulty and selected the best strategy for each level.
As shown in the left figure below, in the revision scenario, the gap between the standard best-of-N approach (generating multiple answers and selecting the best one) and the computationally optimal extension gradually widens, allowing the computationally optimal extension to surpass the best-of-N approach while using 4 times less test computing resources.
Similarly, in the PRM search environment, the computationally optimal extension shows significant improvements over best-of-N in the early stages, and in some cases even approaches or exceeds the performance of best-of-N with 4 times less computing resources.
The right side of the figure above compares the performance of the PaLM 2-S model with compute-optimal scaling during the test phase and a pre-trained model that uses no additional test compute, the latter being a 14x larger pre-trained model .
The researchers considered pre-training with ???? tokens and inference with ???? tokens, both of which are expected in both models. We can see that in the revised scenario (top right) , when ???? << ????, computation at test time generally outperforms additional pre-training.
However, as the ratio of inference to pre-training tokens increases, test-phase computation is still preferred on easy problems, while on harder problems, pre-training is superior in these cases. The researchers also observed a similar trend in the PRM search scenario.
The study also compared the effects of test-time computation with that of adding pre-training. When the amount of computation was matched, for easy and medium-difficulty problems, additional test-time computation generally outperformed additional pre-training.
For more difficult problems, it is more effective to add pre-training calculations.
Overall, the study reveals that current test-time computation expansion methods may not be able to completely replace pre-training expansion, but have shown advantages in some cases.
Aroused heated discussion among netizens
After this research was posted by netizens, it sparked heated discussion.
Some netizens even said that this explains the reasoning method of OpenAI's "Strawberry" model.
Why do you say that?
It turned out that just last night at midnight, foreign media The Information released news that OpenAI's new model Strawberry is scheduled to be released within the next two weeks. Its reasoning ability has been greatly improved, and users do not need additional prompts for input.
Strawberry does not blindly pursue the Scaling Law. The biggest difference from other models is that it will "think" before answering.
So it takes 10-20 seconds for Strawberry to respond .
This netizen speculated that Strawberry may have used a method similar to that of Google DeepMind's study (doge):
If you disagree, explain with an alternative line of reasoning!
Explanation is explanation:
This article explores best-of-n sampling and Monte Carlo Tree Search (MCTS) .
Strawberry might be a hybrid deep model with special tokens (e.g. backtracking, planning, etc.) It might be trained with human data annotators and reinforcement learning from easily validated domains (e.g. math/programming) .
Paper link: https://arxiv.org/pdf/2408.03314
Reference links:
[1] https://x.com/deedydas/status/1833539735853449360
[2] https://x.com/rohanpaul_ai/status/1833648489898594815
-over-
QuantumBit's annual AI theme planning Now soliciting!
Welcome to submit your contributions to the special topic 1,001 AI applications , 365 AI implementation solutions
Or share with us the AI products you are looking for or the new AI trends you have discovered
Click here ???? Follow me, remember to mark the star~