Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large   Language Models

Bofei Gao; Feifan Song; Zhe Yang; Zefan Cai; Yibo Miao; Qingxiu Dong,; Lei Li; Chenghao Ma; Liang Chen; Runxin Xu; Zhengyang Tang; Benyou Wang,; Daoguang Zan; Shanghaoran Quan; Ge Zhang; Lei Sha; Yichang Zhang; Xuancheng; Ren; Tianyu Liu; Baobao Chang

arXiv:2410.07985·cs.CL·December 25, 2024·3 cites

Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models

Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong,, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, Zhengyang Tang, Benyou Wang,, Daoguang Zan, Shanghaoran Quan, Ge Zhang, Lei Sha, Yichang Zhang, Xuancheng, Ren, Tianyu Liu, Baobao Chang

PDF

Open Access 2 Repos 2 Models 5 Datasets

TL;DR

This paper introduces Omni-MATH, a comprehensive benchmark with 4428 Olympiad-level math problems designed to evaluate large language models' reasoning capabilities, revealing current models' struggles with high-level mathematical problems.

Contribution

The paper presents a new, challenging Olympiad-level math benchmark with extensive categorization and difficulty levels, filling gaps left by existing datasets.

Findings

01

Even advanced models perform poorly on Olympiad problems.

02

Models achieve around 50-60% accuracy, indicating room for improvement.

03

The benchmark enables detailed assessment of mathematical reasoning in LLMs.

Abstract

Recent advancements in large language models (LLMs) have led to significant breakthroughs in mathematical reasoning capabilities. However, existing benchmarks like GSM8K or MATH are now being solved with high accuracy (e.g., OpenAI o1 achieves 94.8\% on MATH dataset), indicating their inadequacy for truly challenging these models. To bridge this gap, we propose a comprehensive and challenging benchmark specifically designed to assess LLMs' mathematical reasoning at the Olympiad level. Unlike existing Olympiad-related benchmarks, our dataset focuses exclusively on mathematics and comprises a vast collection of 4428 competition-level problems with rigorous human annotation. These problems are meticulously categorized into over 33 sub-domains and span more than 10 distinct difficulty levels, enabling a holistic assessment of model performance in Olympiad-mathematical reasoning.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Intelligent Tutoring Systems and Adaptive Learning · Topic Modeling