EternalMath: A Living Benchmark of Frontier Mathematics that Evolves with Human Discovery

Jicheng Ma; Guohua Wang; Xinhua Feng; Yiming Liu; Zhichao Hu; Yuhong Liu

arXiv:2601.01400·cs.CL·May 8, 2026

EternalMath: A Living Benchmark of Frontier Mathematics that Evolves with Human Discovery

Jicheng Ma, Guohua Wang, Xinhua Feng, Yiming Liu, Zhichao Hu, Yuhong Liu

PDF

TL;DR

EternalMath is a dynamic, automated benchmark derived from recent mathematical literature, designed to evaluate and track the evolving capabilities of large language models in frontier mathematics.

Contribution

The paper introduces a fully automated, theorem-grounded pipeline for creating an evolving, verifiable mathematical reasoning benchmark directly from research papers.

Findings

01

State-of-the-art LLMs show significant gaps in frontier mathematical reasoning.

02

EternalMath enables scalable, reproducible, and continuously updatable evaluation.

03

The approach supports domain-specific customization and temporal extensibility.

Abstract

Current evaluations of mathematical reasoning in large language models (LLMs) are dominated by static benchmarks, either derived from competition-style problems or curated through costly expert effort, resulting in limited coverage of research-level mathematics and rapid performance saturation. We propose a fully automated, theorem-grounded pipeline for evaluating frontier mathematical reasoning, which directly transforms recent peer-reviewed mathematical literature into executable and verifiable reasoning tasks. The pipeline identifies constructive or quantitative results, instantiates them into parameterized problem templates, and generates deterministic solutions through execution-based verification, enabling scalable, reproducible, and continuously updatable evaluation without reliance on large-scale expert authoring. By design, this approach supports temporal extensibility,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.