Scaling Generative Verifiers For Natural Language Mathematical Proof Verification And Selection

Sadegh Mahdavi; Branislav Kisacanin; Shubham Toshniwal; Wei Du; Ivan Moshkov; George Armstrong; Renjie Liao; Christos Thrampoulidis; Igor Gitman

arXiv:2511.13027·cs.AI·November 18, 2025

Scaling Generative Verifiers For Natural Language Mathematical Proof Verification And Selection

Sadegh Mahdavi, Branislav Kisacanin, Shubham Toshniwal, Wei Du, Ivan Moshkov, George Armstrong, Renjie Liao, Christos Thrampoulidis, Igor Gitman

PDF

Open Access

TL;DR

This paper investigates scalable generative verification methods for mathematical proof verification using large language models, highlighting the importance of evaluation setup, prompt design, and reinforcement learning in improving proof validity and solution selection.

Contribution

It introduces a comprehensive evaluation framework, scales verification methods to large datasets, and analyzes the impact of prompt design and reinforcement learning on model performance.

Findings

01

Combining GenSelect and LLM-as-a-Judge improves verification accuracy.

02

Prompt choice significantly influences LLM-as-a-Judge performance.

03

Reinforcement learning reduces prompt sensitivity but does not improve final-answer accuracy.

Abstract

Large language models have achieved remarkable success on final-answer mathematical problems, largely due to the ease of applying reinforcement learning with verifiable rewards. However, the reasoning underlying these solutions is often flawed. Advancing to rigorous proof-based mathematics requires reliable proof verification capabilities. We begin by analyzing multiple evaluation setups and show that focusing on a single benchmark can lead to brittle or misleading conclusions. To address this, we evaluate both proof-based and final-answer reasoning to obtain a more reliable measure of model performance. We then scale two major generative verification methods (GenSelect and LLM-as-a-Judge) to millions of tokens and identify their combination as the most effective framework for solution verification and selection. We further show that the choice of prompt for LLM-as-a-Judge significantly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMathematics, Computing, and Information Processing · Machine Learning in Materials Science · Topic Modeling