EternalMath: A Living Benchmark of Frontier Mathematics that Evolves with Human Discovery
Jicheng Ma, Guohua Wang, Xinhua Feng, Yiming Liu, Zhichao Hu, Yuhong Liu

TL;DR
EternalMath is a dynamic, automated benchmark derived from recent mathematical literature, designed to evaluate and track the evolving capabilities of large language models in frontier mathematics.
Contribution
The paper introduces a fully automated, theorem-grounded pipeline for creating an evolving, verifiable mathematical reasoning benchmark directly from research papers.
Findings
State-of-the-art LLMs show significant gaps in frontier mathematical reasoning.
EternalMath enables scalable, reproducible, and continuously updatable evaluation.
The approach supports domain-specific customization and temporal extensibility.
Abstract
Current evaluations of mathematical reasoning in large language models (LLMs) are dominated by static benchmarks, either derived from competition-style problems or curated through costly expert effort, resulting in limited coverage of research-level mathematics and rapid performance saturation. We propose a fully automated, theorem-grounded pipeline for evaluating frontier mathematical reasoning, which directly transforms recent peer-reviewed mathematical literature into executable and verifiable reasoning tasks. The pipeline identifies constructive or quantitative results, instantiates them into parameterized problem templates, and generates deterministic solutions through execution-based verification, enabling scalable, reproducible, and continuously updatable evaluation without reliance on large-scale expert authoring. By design, this approach supports temporal extensibility,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
