Towards Robust Mathematical Reasoning

Thang Luong; Dawsen Hwang; Hoang H. Nguyen; Golnaz Ghiasi; Yuri Chervonyi; Insuk Seo; Junsu Kim; Garrett Bingham; Jonathan Lee; Swaroop Mishra; Alex Zhai; Clara Huiyi Hu; Henryk Michalewski; Jimin Kim; Jeonghyun Ahn; Junhwi Bae; Xingyou Song; Trieu H. Trinh; Quoc V. Le; Junehyuk Jung

arXiv:2511.01846·cs.CL·November 4, 2025

Towards Robust Mathematical Reasoning

Thang Luong, Dawsen Hwang, Hoang H. Nguyen, Golnaz Ghiasi, Yuri Chervonyi, Insuk Seo, Junsu Kim, Garrett Bingham, Jonathan Lee, Swaroop Mishra, Alex Zhai, Clara Huiyi Hu, Henryk Michalewski, Jimin Kim, Jeonghyun Ahn, Junhwi Bae, Xingyou Song, Trieu H. Trinh, Quoc V. Le

PDF

Open Access 1 Models 5 Datasets 1 Video

TL;DR

This paper introduces IMO-Bench, a comprehensive set of challenging mathematical reasoning benchmarks modeled after the IMO, to evaluate and improve foundation models' proof-writing and problem-solving skills, achieving state-of-the-art results.

Contribution

The paper presents IMO-Bench, a new suite of advanced mathematical reasoning benchmarks for foundation models, including both problem-solving and proof-writing evaluations, with automatic grading tools.

Findings

01

Achieved 80.0% on IMO-AnswerBench

02

Achieved 65.7% on IMO-Proof Bench

03

Surpassed previous models by large margins

Abstract

Finding the right north-star metrics is highly critical for advancing the mathematical reasoning capabilities of foundation models, especially given that existing evaluations are either too easy or only focus on getting correct short answers. To address these issues, we present IMO-Bench, a suite of advanced reasoning benchmarks, vetted by a panel of top specialists and that specifically targets the level of the International Mathematical Olympiad (IMO), the most prestigious venue for young mathematicians. IMO-AnswerBench first tests models on 400 diverse Olympiad problems with verifiable short answers. IMO-Proof Bench is the next-level evaluation for proof-writing capabilities, which includes both basic and advanced IMO level problems as well as detailed grading guidelines to facilitate automatic grading. These benchmarks played a crucial role in our historic achievement of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
davanstrien/setfit-hf-dataset-domain-v0
model· 29 dl
29 dl

Datasets

Videos

Towards Robust Mathematical Reasoning· underline

Taxonomy

TopicsMathematics, Computing, and Information Processing · Mathematics Education and Teaching Techniques · Machine Learning in Materials Science