Gemini video reasoning is far ahead of GPT-4o, Jeff Dean forwarded it three times in a row, and the first video multimodal benchmark Video-MME is here
Mengchen sent from Aofei Temple
Quantum Bit | Public Account QbitAI
OpenAI and Google held two consecutive press conferences, bringing AI video reasoning to new heights.
However, the industry still lacks a benchmark that can comprehensively evaluate the video reasoning capabilities of large models.
Finally, the multimodal large model video analysis comprehensive evaluation benchmark Video-MME comprehensively evaluates the comprehensive video understanding capabilities of multimodal large models , filling the gap in this field.
Gemini 1.5 Pro is far ahead in this list, showing its "dominant" position in the field of video understanding. Once launched, Video-MME was forwarded three times by Jeff Dean, chief scientist of Google .
The video reasoning capabilities touted by GPT-4o and Google Gemini 1.5 Pro have finally been verified for the first time on a new and more complex multimodal benchmark, Video-MME.
At the same time, major companies and research institutions, such as NVIDIA, ByteDance and other models have also joined the melee.
Video-MME was jointly launched by USTC, Xiamen University, the Chinese University of Hong Kong and other universities, and its code and dataset have been open sourced.
Fully manually annotated high-quality dataset
This benchmark adopts full manual annotation , which is different from existing datasets. In the following example, accurately answering the question requires obtaining information from vision, subtitles, and audio at the same time, and the effective information directly spans a 30-minute interval:
Video-MME has the following notable features:
Extensiveness of the time dimension : The video lengths range from 11 seconds to 1 hour , covering short (<2 minutes), medium (4-15 minutes), and long (30-60 minutes) video lengths, comprehensively evaluating the model's contextual multimodal understanding capabilities under different time spans;
Richness of data modality : In addition to video frames, Video-MME also integrates subtitle and audio modality input to comprehensively evaluate the multimodal processing capabilities of large models;
Diversity of video types : covering six major areas, including knowledge, film and television, sports, art, life records, and multilingual, involving 30 fine-grained sub-areas ;
High standard of annotation quality : 900 videos, 254 hours of content Manually annotated and verified by professionals with large model background, 2,700 question-answer pairs were generated. Question types cover 12 types including perception, cognition, and summary ;
Reliable effective duration (the shortest duration required for Certificate Length to accurately answer questions) : For short videos, medium videos, and long videos, the median effective duration of the Video-MME dataset is 26.0 seconds, 164.7 seconds, and 890.7 seconds, respectively, requiring the model to digest longer video content to answer questions ;
Comprehensive experimental evaluation: The article selected 6 representative open-source video language models and closed-source models Gemini 1.5 Pro and GPT-4V/o for comprehensive experimental analysis . At the same time, the article also selected a large multimodal model based on images for evaluation (generalized to multi-image input), proving that it is applicable to both image and video multimodal large models.
The article selected a variety of representative open source video multimodal models, including ST-LLM, VideoChat2-Mistral, Chat-UniVi-V1.5, LLaVA-NeXT-Video and VILA-1.5, as well as closed source models Gemini and GPT-4V/o. At the same time, image-based multimodal models include Qwen-VL-Chat, Qwen-VL-Max and InternVL-Chat-V1.5.
Among commercial models, Gemini 1.5 Pro excels in video understanding, leading with 81.3% accuracy when assisted by subtitles, and outperforming GPT-4V and GPT-o by 18% and 4.1% respectively.
Although its performance drops slightly as video length increases, its performance on long videos (with subtitles) is better than all open source models on short videos .
At the same time, Gemini 1.5 Pro also supports audio input, which supports a wider range of modalities. Among the open source models, VILA-1.5 from NVIDIA performed best with an accuracy of 59.4%. However, compared with Gemini 1.5 Pro, VILA-1.5 still has significant gaps in counting problems, action recognition, and time perception.
At the same time, as the length of the video increases, the performance of all models shows a clear downward trend, which also shows that there is still a lot of room for improvement when facing longer contextual memory and more complex tasks. In addition, the experiment also revealed that subtitles and audio information can significantly enhance video comprehension, especially for long videos.
Gemini 1.5 Pro showed different performance on 30 different types of videos. For example, some tasks are more dependent on subtitles and voice, such as long basketball videos. Adding subtitles and voice can significantly improve performance. Please refer to the original paper for detailed experimental results.
Comprehensive experimental results show that the current multimodal large models still have a long way to go in video understanding, especially long video understanding. On the one hand, it is necessary to improve the model's multimodal long-context understanding ability. Gemini 1.5 Pro supports context windows of up to one million in length, which is the basis for its excellent performance. On the other hand, it is also urgent to build a corresponding high-quality long video understanding dataset, which is still blank at present.
Paper link: https://arxiv.org/pdf/2405.21075
Project homepage: https://video-mme.github.io
Project repository: https://github.com/BradyFU/Video-MME
Please send your submissions to:
ai@qbitai.com
Please indicate [Submission] in the title and tell us:
Who are you, where are you from, what is your contribution
Attach the link to the paper/project homepage and contact information
We will (try to) reply you promptly
Click here ???? Follow me, remember to mark the star~