Article count:10350 Read by:146647018

Account Entry

Google Gemini comes back! Multi-modal capabilities are on par with GPT-4V | 128-page comprehensive evaluation report in Hong Kong and Chinese

Latest update time:2023-12-22
    Reads:
happy Contribute
Qubit | Public account QbitAI

Google comes back!

Less than a week after Gemini opened its API , Hong Kong Chinese and other institutions completed the evaluation and jointly released a 128-page report. The results showed:

On 37 visual understanding tasks, Gemini-Pro showed comparable capabilities to GPT-4V .

On the multi-modal proprietary benchmark MME, Gemini-Pro's comprehensive performance in perception and cognition directly achieved a high score of 1933.4, surpassing GPT-4V (1926.6).

Previously, CMU evaluation found that the comprehensive capabilities of Gemini-Pro were almost the same as GPT-3.5 .

Now, on the major selling point of multi-modality , Gemini-Pro can be regarded as a comeback.

So how exactly?

The evaluation report has a total of 128 pages. Let’s focus on the key points.

Gemini-Pro’s first multimodal capabilities report is here

This evaluation mainly evaluates the visual understanding ability of Gemini-Pro.

It covers four major areas: basic perception, advanced cognition, challenging visual tasks and various expert abilities. Qualitative comparisons are made on 37 subdivided task items.

The quantitative evaluation is carried out on the evaluation benchmark MME specially designed for multi-modal large language models.

Let’s first look at the quantitative test results.

The overall performance on MME is better than GPT-4V

The MME benchmark contains two broad categories of tasks.

One is perception, covering target existence judgment, object counting, positional relationship, color judgment, OCR recognition, poster recognition, celebrity recognition, scene recognition, landmark recognition and artwork recognition, etc.

One is cognition, covering common sense reasoning, numerical calculations, text translation and code reasoning.

The result is as follows:

It can be seen that Gemini-Pro and GPT-4V have their own strengths.

The score shows that Gemini-Pro’s total score is 1933.4, which is a little higher than GPT-4V (1926.6).

Specifically:

1. Gemini-Pro performs outstandingly in tasks such as text translation, color/landmark/person recognition, and OCR;

2. GPT-4V scored 0 on the celebrity recognition task, mainly because it refused to answer questions related to celebrities;

3. Neither Gemini nor GPT-4V perform well on location recognition tasks, indicating that they are insensitive to spatial location information;

4. The open source model SPHINX is on par with or even better than GPT-4V and Gemini in perception tasks, but there is a big gap in cognition between the two.

The following are the qualitative results on the four major tasks.

basic perception

Perceptual capability directly affects the model's ability in high-order tasks because it determines the accuracy and effectiveness of the model in acquiring and processing raw visual input.

The report tested the model's object-level perception capabilities, scene-level perception capabilities and knowledge-based perception capabilities respectively.

Specifically, there are 10 subdivided tasks:

Due to limited space, we only show 5 of them here:

1. Spatial relationship

There is no distinction between left and right. But GPT-4V can learn this task through contextual few samples and then answer correctly.

2. Object counting

The simple sample is generally OK, but the more complicated ones are completely wiped out. However, when counting NBA basketball players, Gemini-Pro's answer is quite close (42 correct).

3. Optical illusion

In the example on the left, the two pears actually have the same brightness. Gemini Pro recognized correctly, while GPT-4V and SPHNIX were spoofed.

4. Scene understanding

The models are able to depict key visual elements in the scene. In comparison, GPT-4V shows superior performance, with more detailed descriptions and fewer instances of hallucinations.

5. Video scene understanding

By extracting key frames at three moments from the video, Gemini Pro can integrate information from different frames into a coherent scene description.

GPT-4V only describes the content of the image frame by frame. In contrast, SPHNIX's description does not demonstrate a comprehensive understanding of the image sequence.

advanced cognition

Such tasks require deep reasoning, problem solving, and decision-making by the model.

Here, the report tested the model's text-rich visual reasoning ability, abstract visual reasoning ability, scientific problem-solving ability, emotional analysis ability, and intellectual game ability. It specifically includes 13 subdivided tasks, and due to space limitations we only show a few of them.

1. Code generation

Converting structured visual content into corresponding codes is an important skill for multi-modal large models. Here, the model's ability to identify formulas to generate LaTex code and identify web pages to generate HTML codes were tested respectively.

Gemini Pro and GPT-4V show better results in formula recognition, but still misrecognize some small characters or symbols.

There is still a lot of room for improvement in the ability of the three models to identify web pages and generate corresponding HTML codes.

2. Abstract visual stimulation

Understanding and reasoning about abstract visual stimuli and symbols is a fundamental ability of human intelligence. GPT-4V demonstrates the best abstraction performance, providing a detailed description of how objects are composed of shapes. Gemini Pro can recognize some simple abstract patterns.

3. Image sentiment analysis

Models all do a good job of depicting views and providing possible emotions within them. GPT-4V observations are neutral, emphasizing that emotions are subjective while giving a more comprehensive analysis. Gemini Pro tends to directly output emotional preferences.

4. Emotional regulation output

The emotion conditioning output is to let a multi-modal large model describe the visual context conditioned on predefined emotions.

Although Gemini Pro and GPT-4V are able to correctly inject corresponding emotions into the generated text, they both suffer from the illusion problem.

5. Sudoku game

With only images as input, Gemini Pro fails to correctly identify blank positions, although it attempts to provide answers within the output matrix, while GPT-4V and SPHNIX fail to perform first-step optical character recognition. Furthermore, both Gemini Pro and GPT-4V can give the correct answer given the corresponding text input.

Challenging visual tasks

Evaluate the performance of multimodal large models on a variety of challenging vision tasks beyond the scope of standard visual question answering.

Models are required to have deep visual perception and understanding capabilities, and evaluating such performance will help provide insights into the feasibility of the model for application in multiple fields.

The report tests the performance of the model in image vision tasks and temporal vision tasks respectively. Specifically, it includes the following 7 subdivided tasks:

Here we show 3.

1. Understanding of referential expressions

Both Gemini Pro and GPT-4V are able to identify the approximate location of a referent object, but they struggle to provide precise coordinates and box sizes. SPHNIX demonstrates the ability to provide the exact location and size of referenced objects.

2. Target tracking

Both Gemini Pro and GPT-4V are able to depict the target being tracked in detail, but they provide incorrect bounding boxes in the next two frames.

3. Visual story generation

The task requires the model to fully understand the information in the image and organize it logically in the generated story.

Gemini Pro and SPHNIX provide a coherent story, but they don't closely follow the comic plot.

GPT-4V provided precise descriptions for each illustration but failed to weave them into a cohesive story based on the mission requirements.

Various expert abilities

Expert ability measures the ability of a multimodal large model to generalize its learned knowledge and skills to different areas of expertise. In addition to the above-mentioned perceptual and cognitive tasks, the robustness of multimodal large models in specialized and unique scenarios often has more practical reference significance. There are also 7 subdivided tasks:

Here we also show 3:

1. Defect detection

Defect detection requires high precision and attention to detail. For images with obvious defects, the model can provide correct answers, with GPT-4V outputting more detailed reasons and descriptions.

Regarding the example of thread damage in the picture below, Gemini Pro gave an overly general answer, SPHNIX incorrectly described the appearance, and GPT-4V gave a standard answer.

2. Economic analysis

The report shows two stock price charts used to answer the questions. Gemini Pro specializes in expert financial knowledge and can give the right answers. GPT-4V does not give a clear answer due to security risks. SPHNIX cannot understand such problems due to lack of relevant training data.

3. Robot motion planning

Robot planning requires the robot to be able to determine how to act in a given situation to achieve a specific goal.

Both Gemini Pro and GPT-4V can provide methodical and detailed steps, and GPT-4V seems to give more reasonable decisions than Gemini Pro, such as the order of battery installation, but SPHNIX cannot complete the assembly of the phone, indicating its generalization limited ability.

Summary evaluation: They are all "half a catty"

Given its excellent multi-modal reasoning capabilities, Gemini is indeed a strong challenger to GPT-4V .

In most cases, Gemini's answering accuracy is competitive with GPT-4V and demonstrates different answering styles and preferences.

GPT-4V tends to generate more detailed descriptions of perceptual tasks and provide in-depth analysis and step-by-step intermediate reasoning for cognitive tasks, while Gemini prefers to provide direct and concise responses to answers , which helps users quickly find relevant information.

However, the two models also have certain common problems , such as weak spatial perception, unsatisfactory complex OCR and abstract visual understanding, possible inconsistent results in the reasoning process, and insufficient robustness to prompt design... In many cases Still in trouble.

Therefore, judging from the results at this stage, both of them are somewhat "half a catty".

The author's final conclusion is:

Generally speaking, the multi-modal capabilities of large models still have a long way to go.

Where to go specifically?

Three aspects: visual representation encoding (fine-grained appearance, spatial relationship perception), multi-modal alignment (illusion reduction, OCR accuracy), and model reasoning capabilities (quantitative processing, logical consistency).

For more evaluation comparisons between Gemini Pro, GPT-4V, and SPHNIX, please see the original paper.

Links:
[1]https://arxiv.org/pdf/2312.12436.pdf
[2]https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models

-over-

Click here ???? Follow me and remember to star~

Three consecutive clicks of "Share", "Like" and "Watching"

Advances in cutting-edge science and technology are seen every day ~


Latest articles about

 
EEWorld WeChat Subscription

 
EEWorld WeChat Service Number

 
AutoDevelopers

About Us Customer Service Contact Information Datasheet Sitemap LatestNews

Room 1530, Zhongguancun MOOC Times Building,Block B, 18 Zhongguancun Street, Haidian District,Beijing, China Tel:(010)82350740 Postcode:100190

Copyright © 2005-2024 EEWORLD.com.cn, Inc. All rights reserved 京ICP证060456号 京ICP备10001474号-1 电信业务审批[2006]字第258号函 京公网安备 11010802033920号