Test-time Verification via Optimal Transport: Coverage, ROC, & Sub-optimality

Arpan Mukherjee; Marcello Bullo; Debabrota Basu; Deniz G\"und\"uz

arXiv:2510.18982·cs.AI·October 23, 2025

Test-time Verification via Optimal Transport: Coverage, ROC, & Sub-optimality

Arpan Mukherjee, Marcello Bullo, Debabrota Basu, Deniz G\"und\"uz

PDF

Open Access 3 Reviews

TL;DR

This paper models test-time verification in large language models as an optimal transport problem, revealing three regimes of sub-optimality and coverage interaction, supported by empirical results across multiple models.

Contribution

It introduces a unified transport framework to analyze the interplay of coverage, verifier ROC, and sub-optimality in test-time scaling, and characterizes three distinct regimes.

Findings

01

Identification of three regimes: transport, policy improvement, saturation.

02

Analysis of sequential and batched sampling algorithms and their complexities.

03

Empirical validation with Qwen, Llama, and Gemma models.

Abstract

While test-time scaling with verification has shown promise in improving the performance of large language models (LLMs), the role of the verifier and its imperfections remain underexplored. The effect of verification manifests through interactions of three quantities: (i) the generator's coverage, (ii) the verifier's region of convergence (ROC), and (iii) the sampling algorithm's sub-optimality. Though recent studies capture subsets of these factors, a unified framework quantifying the geometry of their interplay is missing. We frame verifiable test-time scaling as a transport problem. This characterizes the interaction of coverage, ROC, and sub-optimality, and uncovers that the sub-optimality--coverage curve exhibits three regimes. A transport regime -- where sub-optimality increases with coverage, a policy improvement regime -- where sub-optimality may decrease with coverage,…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

- The optimal transport formulation elegantly unifies the analysis of generator coverage, verifier imperfections, and sampling strategies. This provides a principled way to reason about test-time verification that goes beyond existing asymptotic analyses. - The paper derives closed-form expressions for sub-optimality (Theorems 3.6, 3.8, 3.10) and computational complexity (Theorems 3.2, 3.5) across multiple algorithms. These exact results are more informative than asymptotic bounds for practical

Weaknesses

- All experiments use one GSM8K question (sample 2), following Dorner et al. (2025). While this protocol has precedent, it limits our ability to assess whether the three-regime structure generalizes across problems of varying difficulty and characteristics. Even testing on 5-10 representative problems would strengthen the claims. - SRS and SMC require knowing s_ver (the reference policy's mass on the verifier set), which may not be available at test time. Section 4 treats it as a hyperparameter

Reviewer 02Rating 4Confidence 3

Strengths

* Theoretical decomposition of sub-optimality into OTC is clear and a verifier-driven factor three-regime picture is intuitive and well explained. * Precise AiC analysis with explicit coverage violation condition and sub-optimality formulas. * SMC construction via maximal coupling with matching complexity to SRS; neatly links transport optimality with compute. * BRS exponential improvement with tight envelope; theoretically beats BoN in low/intermediate regimes. * Empirical studies match theoret

Weaknesses

* GSM-8K is a very limited dataset, and especially as a test-time compute paper, I strongly recommend using a reasoning benchmark. You can either one or multiple of the AIME, OlympiadBench, MATH, GPQA or AMC questions to test the validity of the approaches here. * The analysis of hybrid approaches between sequential and batched protocols such as beam search (also called as block-wise Best of N) is required. * Actionable insights is not clear. How can the theoretical bounds and observations here

Reviewer 03Rating 8Confidence 1

Strengths

See summary.

Weaknesses

See summary.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning and Algorithms