LemmaBench: A Live, Research-Level Benchmark to Evaluate LLM Capabilities in Mathematics

Antoine Peyronnet; Fabian Gloeckle; Amaury Hayat

arXiv:2602.24173·cs.AI·March 2, 2026

LemmaBench: A Live, Research-Level Benchmark to Evaluate LLM Capabilities in Mathematics

Antoine Peyronnet, Fabian Gloeckle, Amaury Hayat

PDF

Open Access

TL;DR

LemmasBench introduces a dynamic, research-level benchmark for evaluating large language models on mathematics, extracting lemmas from arXiv to assess models' ability to prove research-level theorems.

Contribution

The paper presents a novel, updatable benchmark pipeline that extracts and rewrites research lemmas from arXiv, enabling ongoing evaluation of LLMs on cutting-edge mathematical research.

Findings

01

Current LLMs achieve 10-15% accuracy in theorem proving.

02

The benchmark can be regularly updated with new research problems.

03

There is significant room for improvement in LLMs' mathematical reasoning.

Abstract

We present a new approach for benchmarking Large Language Model (LLM) capabilities on research-level mathematics. Existing benchmarks largely rely on static, hand-curated sets of contest or textbook-style problems as proxies for mathematical research. Instead, we establish an updatable benchmark evaluating models directly on the latest research results in mathematics. This consists of an automatic pipeline that extracts lemmas from arXiv and rewrites them into self-contained statements by making all assumptions and required definitions explicit. It results in a benchmark that can be updated regularly with new problems taken directly from human mathematical research, while previous instances can be used for training without compromising future evaluations. We benchmark current state-of-the-art LLMs, which obtain around 10-15 $%$ accuracy in theorem proving (pass@1) depending on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMathematics, Computing, and Information Processing · Machine Learning in Materials Science · Topic Modeling