Re$^2$Math: Benchmarking Theorem Retrieval in Research-Level Mathematics
Zicheng Lyu, Wenjie Yang, Shengzhong Zhang, Zengfeng Huang

TL;DR
Re$^2$Math is a benchmark designed to evaluate the ability of language models to retrieve and verify scholarly sources for mathematical proofs, highlighting current system limitations.
Contribution
The paper introduces Re$^2$Math, a novel benchmark for tool-grounded retrieval in mathematical reasoning, enabling systematic evaluation of literature-grounded proof assistance.
Findings
Best fixed-judge ToolAcc is 7.0% on the current benchmark.
Current systems often retrieve valid statements but fail to verify their applicability.
Re$^2$Math supports automatic, continual expansion and reproducibility.
Abstract
Large language models are increasingly capable at closed-world mathematical reasoning, but research assistance also requires source-grounded use of the literature. When a proof reaches a non-trivial step, a useful assistant should determine whether the needed tool (e.g., a lemma) already exists, identify a suitable scholarly source, and verify that its assumptions align with the current proof context. To rigorously evaluate such capabilities, we introduce ReMath, a benchmark for tool-grounded retrieval from partial mathematical proofs. Each instance is built from a candidate instrumental citation in the proof of a main theorem, with hierarchical context and an optional leakage-controlled anchor hint. We also make the task source-grounded yet citation-agnostic in that any admissible theorem sufficient for the proof transition is accepted. Evaluation uses a release-frozen retrieval…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
