Re$^2$Math: Benchmarking Theorem Retrieval in Research-Level Mathematics

Zicheng Lyu; Wenjie Yang; Shengzhong Zhang; Zengfeng Huang

arXiv:2605.09012·cs.AI·May 12, 2026

Re$^2$Math: Benchmarking Theorem Retrieval in Research-Level Mathematics

Zicheng Lyu, Wenjie Yang, Shengzhong Zhang, Zengfeng Huang

PDF

TL;DR

Re$^2$Math is a benchmark designed to evaluate the ability of language models to retrieve and verify scholarly sources for mathematical proofs, highlighting current system limitations.

Contribution

The paper introduces Re$^2$Math, a novel benchmark for tool-grounded retrieval in mathematical reasoning, enabling systematic evaluation of literature-grounded proof assistance.

Findings

01

Best fixed-judge ToolAcc is 7.0% on the current benchmark.

02

Current systems often retrieve valid statements but fail to verify their applicability.

03

Re$^2$Math supports automatic, continual expansion and reproducibility.

Abstract

Large language models are increasingly capable at closed-world mathematical reasoning, but research assistance also requires source-grounded use of the literature. When a proof reaches a non-trivial step, a useful assistant should determine whether the needed tool (e.g., a lemma) already exists, identify a suitable scholarly source, and verify that its assumptions align with the current proof context. To rigorously evaluate such capabilities, we introduce Re $^{2}$ Math, a benchmark for tool-grounded retrieval from partial mathematical proofs. Each instance is built from a candidate instrumental citation in the proof of a main theorem, with hierarchical context and an optional leakage-controlled anchor hint. We also make the task source-grounded yet citation-agnostic in that any admissible theorem sufficient for the proof transition is accepted. Evaluation uses a release-frozen retrieval…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.