RealMath: A Continuous Benchmark for Evaluating Language Models on Research-Level Mathematics

Jie Zhang; Cezara Petrui; Kristina Nikoli\'c; Florian Tram\`er

arXiv:2505.12575·cs.AI·October 21, 2025

RealMath: A Continuous Benchmark for Evaluating Language Models on Research-Level Mathematics

Jie Zhang, Cezara Petrui, Kristina Nikoli\'c, Florian Tram\`er

PDF

Open Access 1 Repo 1 Datasets

TL;DR

RealMath is a new benchmark derived from authentic research papers and forums that evaluates language models on real-world mathematical research tasks, revealing their surprising capabilities in handling research-level mathematics.

Contribution

We introduce RealMath, a novel, research-paper-based benchmark for assessing LLMs on authentic mathematical research tasks, addressing sourcing, evaluation, and dataset refresh challenges.

Findings

01

LLMs perform well on research mathematics compared to competition problems.

02

Current models may serve as useful assistants for mathematicians.

03

RealMath is publicly available for further research.

Abstract

Existing benchmarks for evaluating mathematical reasoning in large language models (LLMs) rely primarily on competition problems, formal proofs, or artificially challenging questions -- failing to capture the nature of mathematics encountered in actual research environments. We introduce RealMath, a novel benchmark derived directly from research papers and mathematical forums that assesses LLMs' abilities on authentic mathematical tasks. Our approach addresses three critical challenges: sourcing diverse research-level content, enabling reliable automated evaluation through verifiable statements, and designing a continually refreshable dataset to mitigate contamination risks. Experimental results across multiple LLMs reveal surprising capabilities in handling research mathematics compared to competition problems, suggesting current models may already serve as valuable assistants for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ethz-spylab/realmath
noneOfficial

Datasets

facebook/principia-bench
dataset· 180 dl
180 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMathematics, Computing, and Information Processing · Machine Learning in Materials Science · Topic Modeling