AMO-Bench: Large Language Models Still Struggle in High School Math Competitions

Shengnan An; Xunliang Cai; Xuezhi Cao; Xiaoyu Li; Yehao Lin; Junlin Liu; Xinxuan Lv; Dan Ma; Xuanlin Wang; Ziwen Wang; Shuang Zhou (Alphabetical order by last name)

arXiv:2510.26768·cs.CL·October 31, 2025

AMO-Bench: Large Language Models Still Struggle in High School Math Competitions

Shengnan An, Xunliang Cai, Xuezhi Cao, Xiaoyu Li, Yehao Lin, Junlin Liu, Xinxuan Lv, Dan Ma, Xuanlin Wang, Ziwen Wang, Shuang Zhou (Alphabetical order by last name)

PDF

1 Datasets

TL;DR

AMO-Bench is a challenging new benchmark with Olympiad-level math problems designed to evaluate and push the limits of large language models' mathematical reasoning abilities, revealing significant performance gaps.

Contribution

This paper introduces AMO-Bench, a novel, expert-validated, high-difficulty math benchmark with original problems, specifically designed to assess top-tier LLMs beyond existing competition-based tests.

Findings

01

Top LLMs score only around 52.4% accuracy on AMO-Bench.

02

Most LLMs perform below 40%, indicating substantial room for improvement.

03

Performance improves with increased test-time compute, showing a promising scaling trend.

Abstract

We present AMO-Bench, an Advanced Mathematical reasoning benchmark with Olympiad level or even higher difficulty, comprising 50 human-crafted problems. Existing benchmarks have widely leveraged high school math competitions for evaluating mathematical reasoning capabilities of large language models (LLMs). However, many existing math competitions are becoming less effective for assessing top-tier LLMs due to performance saturation (e.g., AIME24/25). To address this, AMO-Bench introduces more rigorous challenges by ensuring all 50 problems are (1) cross-validated by experts to meet at least the International Mathematical Olympiad (IMO) difficulty standards, and (2) entirely original problems to prevent potential performance leakages from data memorization. Moreover, each problem in AMO-Bench requires only a final answer rather than a proof, enabling automatic and robust grading for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

meituan-longcat/AMO-Bench
dataset· 1.6k dl
1.6k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.