Peking University AI Mathematical Olympiad Evaluation, o1-mini scores higher than o1-preview

Latest update time：2024-09-22

Reads：

Omni-MATH team contribution
Quantum Bit | Public Account QbitAI

Once OpenAI's o1 series was released, traditional mathematical evaluation benchmarks became insufficient.

MATH-500 , the full-blooded version of the o1 model directly scored 94.8 points.

In the more difficult Mathematical Olympiad Invitational Competition AIME 2024 , o1 also achieved an accuracy rate of 83.3% .

As existing math test sets are gradually conquered, people can’t help but start to wonder: Can large models handle more challenging math competitions, even the Olympic Mathematics Competition?

To this end, the research teams of Peking University and Alibaba jointly created an Olympic assessment benchmark specifically for mathematics competitions - Omni-MATH .

Omni-MATH is specifically designed to evaluate the mathematical reasoning ability of large language models at the Olympiad level. The evaluation set contains 4428 competition-level problems. These problems are carefully categorized, covering 33 (and more) subfields, and are divided into 10 different difficulty levels, allowing us to conduct a detailed analysis of the model's performance in a variety of mathematical disciplines and complexity levels.

The latest rankings, the competition is very fierce:

Except for the full-blooded version of o1, whose API has not yet been released, o1-mini, as a small model, performed the best, with an average score about 8% higher than o1-preview .

The best open source model is Qwen2-MATH-72b , which even outperforms GPT-4o.

In general, the advantage of o1-mini, which focuses on a few capabilities and gives up the route of storing extensive world knowledge, has been verified again.

Omni-MATH: Difficulty and Wide Range of Applications

As a Mathematical Olympiad evaluation benchmark, Omni-MATH has three characteristics:

The reliability of answers is manually verified: 4428 evaluation questions come from different mathematics competitions and forum data, and human participation verifies the accuracy of the answers. In addition, considering the diversity of answers to difficult Olympiad questions, an evaluation method based on GPT4o and the evaluation model is provided, which allows for one-click start of the evaluation.

Clear and reasonable difficulty classification: The evaluation set is relatively challenging as a whole, and the difficulty range is very large. From the Olympic preparatory level (T4) competition CEMC to the top Olympic mathematics competitions (T0) such as IMO, IMC, Putnam, etc. These competitions require not only a solid mathematical foundation, but also superb logical reasoning ability and creativity. Data shows that only a very small number of people with IQs close to the top can achieve excellent results in these competitions.

The types of questions are very wide: there are more than 33 sub-fields of mathematical problems. According to the characteristics of the mathematical field, the team created a tree-like field classification. Each question involves one or more fields, that is, multiple tree paths, allowing us to conduct a detailed analysis of the model's performance in various mathematical disciplines and difficulty levels. "

Construction of the Omni-MATH test set

Data Structure

The research team first conducted a detailed investigation of basic Olympic mathematics competitions at home and abroad, from which they learned that a student has to go through layers of selection from competition preparation to top competitions.

For example, for the British system, the selection process is through the entire first-layer link of JMC → IMC → SMC → BMO 1 → BMO 2 → IMO (this IMC (Intermediate Mathematical Challenge) and the above-mentioned IMC (international mathematical competition for university students) are not the same competition);

However, in the American system, one has to go through a one-level selection system: AMC 8 → AMC 10 → AMC 12 → AIME → USA(J)MO → IMO.

This inspired the team to set a difficulty level for model evaluation. Therefore, the research team investigated competitions of different difficulty levels around the world, so that Omni-MATH still has diversified difficulty in Olympic-level math tests.

In addition, in the Olympic-level math test, there are actually many mathematical fields involved. The research team considered whether there would be a chemical reaction between data from different fields during model training, such as whether data from field A can generalize the model to improve field B. Data engineering in this direction is very meaningful.

In order to lay a foundation for research in this direction, the researchers referred to relevant competition teaching materials and made a very detailed division of the data field in this evaluation set, starting from major mathematics categories such as number theory, algebra, geometry, etc., to specific small fields or knowledge points under the field.

There are two main sources of evaluation data: one is the questions and solutions of various competitions, and the other is the famous mathematics website Art of Problem Solving. For the competition you want, you should first look for the answers from the solutions.

If the desired competition does not have a public solution, the team crawls the replies from the forum of the AoPS website. Considering that the replies are written by real users, there is a certain probability that there are problems, so strict screening is required.

The research team selected questions with more than 3 candidates and regular answers from the AoPS website, and selected questions with 3 consistent answers as the final criteria. The team used manual screening to further ensure accuracy when selecting questions.

Data processing

Processing of the data itself:

After crawling the PDF format of the problem solution, the developers used Mathpix to convert it into Latex format as the problem solution. After crawling the forum answer, they first used GPT-4o to reformat it into a regular reply, and then manually checked whether it was consistent with the answer to the original question.

For these two types of data sources, the team members finally manually checked whether the information was consistent with the data source.

Difficulty classification:

Refer to the AoPS website for the classification of question difficulty.

Specifically, the difficulty of the questions in different levels of competitions is essentially different. For example, the questions between CEMC and IMO are very different. However, the different questions in each competition are also different. For example, an IMO competition has both simple and difficult questions. Therefore, the difficulty classification of the evaluation set strictly follows the difficulty coefficient of each question in different competitions given on the AoPS website (from 1 to 10, mostly integers, a few with difficulty of .5, .25).

For content not covered on the website, the team organized the content on the webpage into few-shot prompts and used GPT4o to mark the difficulty of the questions. The overall difficulty distribution and the distribution of different competition questions are as follows:

Field classification:

Different from the classification of traditional mathematics test benchmarks, the questions in the difficulty level of Olympiad mathematics cover more fields and a broader knowledge.

In order to better organize and unify these Olympiad questions and the subsequent exploration of the relationship between data in mathematical fields, the team built a more comprehensive tree classification system. The research team referred to relevant competition textbooks and divided the fields related to the Olympiad into geometry, algebra, number theory, applied mathematics, etc., and then continued to subdivide these fields into small fields and subtle knowledge points in each field.

This tree-like classification system is more helpful in understanding the relationship between different questions and the performance of the model in different fields. The team used this tree-like classification system as a template and combined it with examples in the competition tutorial to build few-shot prompts (the specific tree structure and prompt content can be found in the code repository at the end of the article).

The team then used GPT-4o to classify each question into one or more categories.

Open source answer validator

Omni-Judge is a verifier obtained by fine-tuning Llama3-Instruct, which is used to verify whether the answer to be tested is consistent with the given answer. Since there are many types of answers to questions at the level of the Mathematical Olympiad, it is actually very difficult to evaluate using rules. After obtaining the model's prediction, it is necessary to determine whether the model's output is consistent with the standard answer. In addition to using GPT-4o for evaluation, we also provide a simpler evaluation method. We used the COT data generated when the GPT4o evaluation model was used to fine-tune Llama3-Instruct to obtain an open source verifier, and the evaluation consistency rate was as high as 95% with GPT-4o.

Reference Links:

Project Page: https://omni-math.github.io/
Github: https://github.com/KbsdJames/Omni-MATH/
Dataset: https://huggingface.co/datasets/KbsdJames/Omni-MATH/
Omni -Judge: https://huggingface.co/KbsdJames/Omni-Judge/