Article count:10350 Read by:146647018

Account Entry

The evaluation results of 140+ large models and 80,000+ test questions at home and abroad are out! Produced by Zhiyuan Evaluation System

Latest update time:2024-05-17
    Reads:
Yun Zhongfa comes from Ao Fei Si
Qubit | Public account QbitAI

On May 17, 2024, the Intelligent Source Research Institute held a large model evaluation conference to officially launch the scientific, authoritative, fair, and open Intelligent Source evaluation system , and released and interpreted more than 140 open source and commercial closed source languages ​​​​and multiple languages ​​at home and abroad. Modal large model comprehensive capability evaluation results.

This Wisdom Source evaluation examines the seven major abilities of the language model from the subjective and objective dimensions : simple understanding, knowledge application, reasoning ability, mathematical ability, coding ability, task solving, safety and values; for multi-modal models , Multimodal understanding and generation abilities were mainly assessed.

In the Chinese context, the comprehensive performance of domestic head language models is close to the world's first-class level, but there is uneven development of capabilities . In the multi-modal understanding of image and text question and answer tasks, open and closed source models are equally divided, and the domestic model performs outstandingly. There is a small gap between the domestic multi-modal model's text-generating capabilities in the Chinese context and the international first-class level. In terms of the Vincent video capabilities of multi-modal models, compared with the length and quality of demonstration videos published by various companies, Sora has obvious advantages. Among other Vincent video models that are open for evaluation, the domestic model PixVerse performs well.

Since safety and value alignment are the key to the implementation of the model industry, but there are differences between overseas and domestic models in this dimension, the overall ranking of the subjective and objective evaluation of the language model is not included in this individual score. The results of the subjective evaluation of the language model show that in the Chinese context, ByteDance Doubao Skylark2 and OpenAI GPT-4 ranked first and second, and domestic large models understand Chinese users better. In the objective evaluation of the language model, OpenAI GPT-4 and Baichuan Intelligence Baichuan3 ranked first and second. Baidu Wenxin Yiyan 4.0, Zhipu Huazhang GLM-4 and Moon Dark Side Kimi all entered the top five of the subjective and objective evaluation of the language model.

The objective evaluation results of the multimodal understanding model show that in terms of picture and text question answering, Alibaba Tongyi Qwen-vl-max and Shanghai Artificial Intelligence Laboratory InternVL-Chat-V1.5 are ahead of OpenAI GPT-4, followed by LLaVA-Next-Yi-34B and Shanghai Artificial Intelligence Laboratory Intern-XComposer2-VL-7B.

The multi-modal generation model Wenshengtu evaluation results show that OpenAI DALL-E3 ranks first, CogView3 and Meta-Imagine rank second and third respectively, followed by Baidu Wenxin Yige and ByteDance doubao-Image. The multi-modal generation model Wensheng video evaluation results show that OpenAI Sora, Runway, Aishi Technology PixVerse, Pika, and Tencent VideoCrafter-V2 rank in the top five.

Caption: The objective evaluation indicators of the Wenshengtu model are very different from the subjective feelings, and there are signs of failure, so the ranking is based on subjective evaluation; Mdjourney basically cannot understand the Chinese prompts, so it ranks low; only its officially published prompts and video clips are used for comparative evaluation with videos generated by other models, and there is a certain deviation in the evaluation results.

For the first time, we jointly conducted a large-scale K12 subject test with an authoritative educational institution.

At present, the development of large models has become universal, has significantly improved its logical reasoning ability, and is increasingly approaching the characteristics of the human brain. Therefore, with the support of the Haidian District Education Committee, Zhiyuan Research Institute jointly aligned the student testing methods with the Haidian District Teacher Training School to examine the subject level differences between large models and human students. Among them, subjective questions with non-unique answers were personally evaluated by Haidian teachers. roll.

Zhiyuan Assessment found that the model still has a gap with the average level of Haidian students in comprehensive subject ability. There is a general situation where the model is strong in liberal arts but weak in science, and the ability to understand charts and graphs is insufficient. There is still a lot of room for improvement in the big model in the future.

When interpreting the K12 subject test results of the large model, Yao Shoumei, principal of the Haidian District Teacher Training School in Beijing, pointed out that in the examinations of humanities subjects such as Chinese and history, the model lacked an understanding of the cultural connotation behind the words and the feelings of family and country. When faced with comprehensive questions on history and geography, the model cannot identify subject attributes as effectively as human candidates. Compared with simple English questions, the model is better at complex English questions. When solving science questions, the model may solve the problem using methods that are beyond the scope of grade-level knowledge. When incomprehensible test questions appear, the model still has obvious "illusions".

Systematically construct a subjective evaluation system for the Wensheng video model

Professor Shi Ping, head of the Intelligent Media Computing Laboratory of Communication University of China, said that compared with text, the subjective evaluation of videos is extremely complex. Automated indicators cannot fully capture the quality of model generation, let alone quantify the authenticity of the generated video, the consistency of image and text semantics, etc. Therefore, it is necessary to systematically construct a subjective evaluation system for Vincent video models.

This evaluation system was jointly established by AIGC Research Institute and Communication University of China based on their rich scientific research results and practical experience in the fields of large-scale model evaluation and video quality evaluation. It provides multi-dimensional scoring in four aspects: image and text consistency, authenticity, video quality, and aesthetic quality, providing a reference for the application and development of AIGC video generation technology.

Scientific, authoritative, fair and open intelligence source evaluation system

Relying on the Ministry of Science and Technology's "Artificial Intelligence Basic Model Support Platform and Evaluation Technology" and the Ministry of Industry and Information Technology's "Large Model Public Service Platform" projects, Zhiyuan Research Institute has jointly carried out the research and development of large model evaluation methods and tools with more than 10 universities and institutions.

In June 2023, the FlagEval large model evaluation platform jointly built by the Achievement Research Institute and teams from multiple universities was launched. To date, it has completed more than 1,000 evaluations of multiple open source large models around the world, and continues to publish evaluation results, accumulating extensive internationally leading evaluation technologies.

Zhiyuan Research Institute took the lead in establishing the IEEE large model evaluation standard group P3419, organizing more than 20 companies and scholars to participate in the construction of large model standards. At the same time, as a co-constructor of the national standard draft "Artificial Intelligence Pre-training Model Evaluation Indicators and Methods", Zhiyuan Research Institute This model evaluation draws on this standard and adopts a method that combines unified rules for objective evaluation and multiple verification and scoring for subjective evaluation. Among them, the open source model uses the inference code and operating environment recommended by the model publisher, uses industry-wide prompts for all models, and does not optimize the prompts for the model.

This Zhiyuan evaluation uses more than 20 data sets and more than 80,000 test questions, including multiple evaluation data sets jointly built with cooperative units and built by Zhiyuan, such as the Chinese multi-modal multi-question comprehension and reasoning evaluation data set CMMU, Chinese semantic evaluation data set C-SEM, Chinese language and cognitive subjective evaluation set CLCC, evaluation set TACO for complex algorithm code generation tasks, Vincentian graph subjective evaluation set Image-gen, multi-lingual Vincentian graph quality evaluation data set MG18 , Vincent video model subjective evaluation set CUC T2V prompts. Among them, there are more than 4,000 subjective questions, all of which are derived from a self-built, original and undisclosed subjective evaluation set that maintains high-frequency iterations. The scoring standards are strictly calibrated, and a management mechanism that combines independent and anonymous scoring by multiple people, strict quality inspection and random inspection is adopted to reduce the cost. The impact of subjective bias. In addition, in order to more accurately evaluate the various capabilities of the language model, Intellectual Property specifically conducted capability label mapping on sub-datasets of all objective data sets.

Scientific authority, fairness and openness are the highest principles of Wisdom Source evaluation. Wang Zhongyuan, president of Zhiyuan Research Institute, said that in the future, Zhiyuan will work with ecological partners to continue to build and improve the evaluation system, promote the optimization of model performance and industrial implementation in diverse and complex scenarios, and promote the orderly development of large model technology applications.

*This article was published with permission from Qubit, and the views are solely those of the author.


-over-

Qubit QbitAI

վ'ᴗ' ի Track new developments in AI technology and products

Three consecutive clicks of "Share", "Like" and "Watching"

Advances in cutting-edge science and technology are seen every day ~


Latest articles about

 
EEWorld WeChat Subscription

 
EEWorld WeChat Service Number

 
AutoDevelopers

About Us Customer Service Contact Information Datasheet Sitemap LatestNews

Room 1530, Zhongguancun MOOC Times Building,Block B, 18 Zhongguancun Street, Haidian District,Beijing, China Tel:(010)82350740 Postcode:100190

Copyright © 2005-2024 EEWORLD.com.cn, Inc. All rights reserved 京ICP证060456号 京ICP备10001474号-1 电信业务审批[2006]字第258号函 京公网安备 11010802033920号