LemmaBench: A Live, Research-Level Benchmark to Evaluate LLM Capabilities in Mathematics
Antoine Peyronnet, Fabian Gloeckle, Amaury Hayat

TL;DR
LemmasBench introduces a dynamic, research-level benchmark for evaluating large language models on mathematics, extracting lemmas from arXiv to assess models' ability to prove research-level theorems.
Contribution
The paper presents a novel, updatable benchmark pipeline that extracts and rewrites research lemmas from arXiv, enabling ongoing evaluation of LLMs on cutting-edge mathematical research.
Findings
Current LLMs achieve 10-15% accuracy in theorem proving.
The benchmark can be regularly updated with new research problems.
There is significant room for improvement in LLMs' mathematical reasoning.
Abstract
We present a new approach for benchmarking Large Language Model (LLM) capabilities on research-level mathematics. Existing benchmarks largely rely on static, hand-curated sets of contest or textbook-style problems as proxies for mathematical research. Instead, we establish an updatable benchmark evaluating models directly on the latest research results in mathematics. This consists of an automatic pipeline that extracts lemmas from arXiv and rewrites them into self-contained statements by making all assumptions and required definitions explicit. It results in a benchmark that can be updated regularly with new problems taken directly from human mathematical research, while previous instances can be used for training without compromising future evaluations. We benchmark current state-of-the-art LLMs, which obtain around 10-15 accuracy in theorem proving (pass@1) depending on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMathematics, Computing, and Information Processing · Machine Learning in Materials Science · Topic Modeling
