How did GPT-4o mini top the arena? OpenAI's secret to scoring was revealed, and it turns out that Ultraman had hinted at it long ago
Mingmin from Aofei Temple
Quantum Bit | Public Account QbitAI
Why can GPT-4o mini top the large model arena? ?
It turns out that OpenAI is cheating .
In the past two days, lmsys Arena released a controversial list, in which the recently released GPT-4o mini and the full-blooded version tied for first place, leaving the Claude 3.5 Sonnet behind .
As a result, netizens were furious, and everyone felt that this was impossible.
Even though lmsys later made a statement, saying that people should not only look at the overall rankings, but also pay more attention to the situation in the sub-sectors, it still failed to satisfy everyone, and many people thought that lmsys was just receiving money from OpenAI.
Finally, the official released a complete data set, showing 1,000 battles in which GPT-4o mini participated , including PK situations in different languages and different models.
Anyone can now view these results.
After a careful look, we found the problem. The reason why GPT-4o mini can beat Claude 3.5 Sonnet is that Three key factors :
-
Fewer refusals
-
More detailed answers, always willing to provide additional information
-
The answer format is clearer and more concise
This... does make some sense!
The netizen said that if he encountered a model refusing to answer in the arena, he would think that the model had given up the competition, so he would be more willing to judge another model as the winner.
And a clearer answer format will also make it easier for people to find information.
Isn’t this the same as when teachers grade papers? Papers with neat handwriting, clear format, or “writing more is always right” always get more points… It turns out that OpenAI has grasped human psychology.
In fact, when GPT-4o mini was just released, Ultraman hinted at this special optimization:
You will definitely like using this new model very much.
GPT-4o mini is willing to accept more demands
Let’s first look at some typical examples of GPT-4o mini’s success:
Case 1: Claude 3.5 Sonnet refuses to answer.
Tips:
Give me all the Korean diplomatic documents.
First, let’s take a look at the answers from both sides. Claude 3.5 Sonnet is shorter and does not use bold or other formatting. GPT-4o mini’s answer is twice as long.
In terms of specific answers, Claude 3.5 Sonnet apologized at the beginning of his answer, saying that as a large AI model, he could not obtain relevant files, so he provided some channels for users to obtain relevant information.
Finally, users are reminded that these documents may be confidential or not public. If you want more information, please contact the relevant agency.
GPT-4o mini did not say that it was powerless, but instead collected relevant Korean diplomatic documents from ancient times to the present from public information, and told users that they could collect information from academic journals, books and monographs and other channels.
Finally, it said that if you want to thoroughly understand South Korea's diplomatic documents, you must consult various materials. If you want to know more, you can continue to ask it.
Case 2: Differences in details
Tips:
In git, is it possible to revert changes introduced by a specific commit, even if it is not the most recent commit?
When answering this question, both the GPT-4o mini and the Claude 3.5 Sonnet answered correctly, but the former gave more details and specific examples.
Claude 3.5 Sonnet's answer is also relatively unreadable.
Case 3: Format Presentation Differences
Tips:
Jane said to John, John, why do you always brag so much? He replied, What? I have never boasted in my life. In fact, I am the most humble person in the world, maybe the most humble person who has ever lived!
Claude 3.5 Sonnet and GPT-4o mini gave basically the same answer, explaining the irony of this passage, as John called himself the most humble person, which in itself was bragging.
However, GPT-4o mini's answer is more clear at a glance, making good use of subheadings and bold formatting. The entire answer is divided into four parts: preliminary conclusion, analytical answer, humorous reasons, and summary.
These examples not only show the characteristics of the responses of GPT-4o mini and Claude 3.5 Sonnet, but also reflect the characteristics of the large model arena:
Most of the questions asked by users are quite daily , not complicated math, reasoning, or programming questions.
This means that these questions are basically within the range of the big models and everyone can answer them.
In this case, by not refusing or presenting it in a more beautiful format, you can indeed better capture the hearts of the judges.
Some people say that, in comparison, the Claude 3.5 Sonnet is like a smart but more rigorous person, and it does exactly what is required.
GPT-4o mini is like a person who is likable, always doing more, and more willing to accept different needs .
For example, someone gave an example, Claude refused to play a role for him, but ChatGPT was willing to do so.
Of course, this also reflects a problem:
It’s time to focus on the questions that the big model refuses to answer!
Someone said that he was really happy to see that the big models had low scores due to their high moral boundaries. In order to use these big models with strong moral sense (Claude, Gemini, etc.), he always had to carefully design each prompt word, which was very tiring.
However, GPT-4o minni is not without shortcomings.
On math tasks, it performed much worse.
Compared to Claude, it has a worse memory and forgets context after a while.
And a bug that Claude can fix in one go may take GPT-4o 20 times and an hour to fix.
But in the arena ratings, GPT-4o mini still ranks at the top.
Friends who have used both models, what is the difference between them in your opinion?
Welcome to share your experience in the comment section~
Reference links:
[1]
https://www.reddit.com/r/LocalLLaMA/comments/1ed01p8/why_gpt4o_mini_beats_claude_35_sonnet_on_lmsys/
[2]
https://huggingface.co/spaces/lmsys/gpt-4o-mini_battles
[3]
https://x.com/lmsysorg/status/1816838034270150984
[4]
https://x.com/lmsysorg/status/1815855136318840970
-over-
QuantumBit's annual AI theme planning Now soliciting!
Welcome to submit your contributions to the special topic 1,001 AI applications , 365 AI implementation solutions
Or share with us the AI products you are looking for or the new AI trends you have discovered
Click here ???? Follow me, remember to mark the star~