ROC-n-reroll: How verifier imperfection affects test-time scaling

Florian E. Dorner; Yatong Chen; Andr\'e F. Cruz; Fanny Yang

arXiv:2507.12399·cs.LG·October 13, 2025

ROC-n-reroll: How verifier imperfection affects test-time scaling

Florian E. Dorner, Yatong Chen, Andr\'e F. Cruz, Fanny Yang

PDF

Open Access 3 Reviews

TL;DR

This paper provides a theoretical analysis of how verifier imperfection impacts test-time scaling methods like Best-of-N and Rejection Sampling, revealing the role of the verifier's ROC curve in determining accuracy and performance limits.

Contribution

It introduces a formal characterization of test-time scaling performance based on the verifier's ROC curve, bridging the gap between empirical results and theoretical understanding.

Findings

01

Rejection Sampling outperforms Best-of-N at fixed compute.

02

Both methods converge to similar accuracy with infinite compute.

03

High-compute performance cannot be reliably predicted from low-compute observations.

Abstract

Test-time scaling aims to improve language model performance by leveraging additional compute during inference. Many works have empirically studied techniques such as Best-of-N (BoN) and Rejection Sampling (RS) that make use of a verifier to enable test-time scaling. However, to date there is little theoretical understanding of how verifier imperfection affects performance -- a gap we address in this work. Specifically, we prove that the instance-level accuracy of these methods is precisely characterized by the geometry of the verifier's ROC curve. Our theory has two important takeaways, confirmed by experiments with Qwen and LLama models on GSM8K and MATH500. First, RS outperforms BoN for fixed compute, while both methods converge to the same accuracy in the infinite-compute limit. Second, it is generally impossible to predict the high-compute performance of either method based on…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

- One of the main strengths of this paper is that the authors present a simple and interpretable mathematical framework for analyzing how verifier imperfections impact the effectiveness of test-time scaling. Their formulation leverages classical concepts from statistical learning theory (e.g., ROC curves, true/false positive rates), providing a clean lens through which to reason about the verifier. This contribution is valuable because it connects modern LLM sampling practices with well-establis

Weaknesses

- One weakness of this paper is its clarity / organization. I think the clarity can be significantly improved by including a table of notations in the main manuscript. On the first pass, it was difficult to follow the derivations because I had to look back to recall the notation. - The theory models correctness as a binary variable (either correct or incorrect). This abstraction simplifies analysis, but it may limit applicability in settings where output quality is continuous or graded, e.g., i

Reviewer 02Rating 6Confidence 2

Strengths

1. Interesting problem, clean formalization, and clear scope that focuses on RS and BoN. 2. The result that RS outperforms BoN with fixed average compute seems useful in practice. 3. Empirical experiments show consistent results with theory's prediction.

Weaknesses

1. **Limited evaluation domain**: Can the authors also consider other benchmark categories such as coding or more general QA? This is critical for evaluation whether the conclusion generalizes. 2. **Compute metric mismatch and missing hybrid method** - The compute metric for RS is defined as an expectation, while the compute metric for BoN is defined as the deterministic N. The stopping time of RS creates variance but the analysis optimize only the mean. Also, the experiments on MATH500 and GS

Reviewer 03Rating 8Confidence 2

Strengths

This paper presents a solid and well-executed contribution to the LLM research community, even though it does not introduce new algorithms. The theoretical analysis of RS and BoN is thorough and rigorous, providing fresh insights into their behaviors when given *imperfect verifiers*. The work offers practical takeaways for researchers and practitioners considering repeated sampling approaches. *(Disclaimer: Some technical content extends beyond my core expertise, and thus I have not verified

Weaknesses

I don't see major weaknesses in this work, assuming that the theoretical analyses are correct. Some suggestions: - The takeaway message "RS outperforms BoN under fixed compute" requires the assumption that the ROC curve (for the particular query under consideration) is concave, according to Proposition 5. I think this assumption should be stated more explicitly in the abstract and introduction, otherwise the takeaway message alone could be slightly misleading. In reality, there is no guarant

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Imaging for Blood Diseases · Anomaly Detection Techniques and Applications