RealMath: A Continuous Benchmark for Evaluating Language Models on Research-Level Mathematics
Jie Zhang, Cezara Petrui, Kristina Nikoli\'c, Florian Tram\`er

TL;DR
RealMath is a new benchmark derived from authentic research papers and forums that evaluates language models on real-world mathematical research tasks, revealing their surprising capabilities in handling research-level mathematics.
Contribution
We introduce RealMath, a novel, research-paper-based benchmark for assessing LLMs on authentic mathematical research tasks, addressing sourcing, evaluation, and dataset refresh challenges.
Findings
LLMs perform well on research mathematics compared to competition problems.
Current models may serve as useful assistants for mathematicians.
RealMath is publicly available for further research.
Abstract
Existing benchmarks for evaluating mathematical reasoning in large language models (LLMs) rely primarily on competition problems, formal proofs, or artificially challenging questions -- failing to capture the nature of mathematics encountered in actual research environments. We introduce RealMath, a novel benchmark derived directly from research papers and mathematical forums that assesses LLMs' abilities on authentic mathematical tasks. Our approach addresses three critical challenges: sourcing diverse research-level content, enabling reliable automated evaluation through verifiable statements, and designing a continually refreshable dataset to mitigate contamination risks. Experimental results across multiple LLMs reveal surprising capabilities in handling research mathematics compared to competition problems, suggesting current models may already serve as valuable assistants for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMathematics, Computing, and Information Processing · Machine Learning in Materials Science · Topic Modeling
