HorizonMath: Measuring AI Progress Toward Mathematical Discovery with Automatic Verification

Erik Y. Wang; Sumeet Motwani; James V. Roggeveen; Eliot Hodges; Dulhan Jayalath; Charles London; Kalyan Ramakrishnan; Flaviu Cipcigan; Philip Torr; Alessandro Abate

arXiv:2603.15617·cs.LG·March 17, 2026

HorizonMath: Measuring AI Progress Toward Mathematical Discovery with Automatic Verification

Erik Y. Wang, Sumeet Motwani, James V. Roggeveen, Eliot Hodges, Dulhan Jayalath, Charles London, Kalyan Ramakrishnan, Flaviu Cipcigan, Philip Torr, Alessandro Abate

PDF

Open Access 1 Datasets

TL;DR

HorizonMath is a new benchmark with over 100 unsolved mathematical problems designed to evaluate AI's ability to make novel discoveries, using automated verification to identify potential breakthroughs by large language models like GPT-5.4.

Contribution

It introduces a scalable, open-source platform for assessing AI-driven mathematical discovery on unsolved problems, avoiding data contamination and manual review.

Findings

01

GPT 5.4 Pro proposed solutions to two problems, improving on known results.

02

HorizonMath enables scalable, automated evaluation of AI's mathematical reasoning.

03

Potential for AI solutions to contribute novel results to mathematics.

Abstract

Can AI make progress on important, unsolved mathematical problems? Large language models are now capable of sophisticated mathematical and scientific reasoning, but whether they can perform novel research is still widely debated and underexplored. We introduce HorizonMath, a benchmark of over 100 predominantly unsolved problems spanning 8 domains in computational and applied mathematics, paired with an open-source evaluation framework for automated verification. Our benchmark targets a class of problems where discovery is hard, requiring meaningful mathematical insight, but verification is computationally efficient and simple. Because these solutions are unknown, HorizonMath is immune to data contamination, and most state-of-the-art models score near 0%. Existing research-level benchmarks instead rely on formal proof verification or manual review, both of which are expensive to scale.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

squashenthus/HorizonMath
dataset· 115 dl
115 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMathematics, Computing, and Information Processing · Machine Learning in Materials Science · Model Reduction and Neural Networks