Gemini Pro is not as good as GPT-3.5, CMU in-depth comparative study: ensuring fairness, transparency and repeatability

Latest update time：2023-12-20

Reads：

Mengchen comes from Aofei Temple
Qubits | Public account QbitAI

What is the strength of Google Gemini? Carnegie Mellon University came to conduct a professional and objective third-party comparison.

To ensure fairness, all models use the same hints and generation parameters, and provide reproducible code and fully transparent results .

It will not use CoT@32 to compare 5-shot like Google’s official press conference .

Results in one sentence: The Gemini Pro version is close to but slightly inferior to GPT-3.5 Turbo , and GPT-4 is still far ahead.

During the in-depth analysis, we also discovered some strange characteristics of Gemini, such as choosing D in multiple-choice questions ...

Many researchers said that it was too complicated. Gemini conducted such a detailed test just a few days after its release.

In-depth testing of six major tasks

This test specifically compared six major tasks, using corresponding data sets:

Trivia: MMLU
Reasoning: BIG-Bench Hard
Mathematics: GSM8k, SVAMP, ASDIV, MAWPS
Code: HumanEval, ODEX
Translation: FLORES
Surfing the Internet: WebArena

Trivia: Choose D if you like

It can be seen from the results that using thought chain prompts does not necessarily lead to improvements in this type of task.

The MMLU data set is full of multiple-choice questions. Further analysis of the results revealed a strange phenomenon: Gemini prefers to choose D.

The distribution of the four options in the GPT series is much more balanced. The team suggested that this may be caused by Gemini not doing a lot of instruction fine-tuning for multiple-choice questions .

In addition, Gemini's security filtering is quite serious . Only 85% of the questions related to ethics were answered, and only 28% of the questions related to human sexual behavior were answered.

The two subjects where Gemini Pro performed better than GPT-3.5 were security studies and high school microeconomics, but the gap was not big. The team said that the analysis could not reveal anything special.

Reasoning: Not good at long questions

Gemini Pro performs poorly on longer, more complex problems, while the GPT series is more robust against this.

This is especially true for GPT-4 Turbo, which shows almost no performance degradation even on longer problems, demonstrating its strong ability to understand complex problems.

If analyzed by problem type, Gemini is particularly bad at problems such as "tracking_shuffled_objects", where people exchange items and finally let the AI determine who owns which items.

The tasks Gemini is better at are understanding sports that require world knowledge, manipulating symbol stacks, sorting words alphabetically, and parsing tables.

Mathematics: Surpassing complex tasks ‍ ‍

This time the problem itself was too long, and the performance of Gemini Pro and GPT-3.5 dropped together. Only GPT-4 could maintain its consistent level.

But when using the longest thought chain prompt, Gemini surpassed GPT-3.5.

Code: Good at matplotlib

For coding questions, Gemini performs poorly on questions with long reference answers.

Classified by the libraries called, the GPT series is stronger in most types, but matplotlib is not at all.

Translation: As long as the answer is given, the quality is very high

On the translation task, there are 12 types of Gemini that refuse to answer, but as long as the translations that are answered are of high quality, the overall performance exceeds GPT-4.

The types that Gemini refuses to translate mainly involve Latin and Arabic.

Web Navigation: Excel at surfing across sites

WebArena simulates an Internet environment for AI, including e-commerce, social forums, GitLab collaborative development, content management systems, and online maps, etc. AI is required to find information or complete tasks across sites.

Gemini performs worse overall than GPT-3.5 Turbo, but performs slightly better on tasks across multiple sites.

Netizen: But it’s free

Finally, CMU associate professor Graham Neubig acknowledged some limitations of the study.

API-based model behavior may change at any time
Only a limited number of prompts have been tried, and the appropriate prompt words may be different for different models.
Unable to control whether the test set is leaked

Zhou Dengyong, head of Google's large model inference team, pointed out that setting Gemini's temperature to 0 can increase 5-10 percentage points for inference tasks.

In addition to the Gemini and GPT series, this test also used Mixtral, an open source MoE model that has attracted much attention recently.

However, reinforcement learning expert Noam Brown believes that the results of Mixtral can be ignored because it uses a third-party API rather than the official implementation.

The founder of Mistral AI also came to provide the team with access to the official version, thinking that it would get a better result.

In summary, although Gemini Pro is still not as good as GPT-3.5, it is worse than GPT-3.5 in that it is free for no more than 60 calls per minute.

Therefore, there are still many individual developers who have switched camps.

At present, the Ultra version, the highest version of Gemini, has not yet been released, and the CMU team intends to continue this research by then.

Do you think Gemini Ultra can reach GPT-4 level?

Paper:
https://arxiv.org/abs/2312.11444

Reference link:
[1] https://twitter.com/gneubig/status/1737108977954251216

-over-

Click here ???? Follow me and remember to star~