Test-time Verification via Optimal Transport: Coverage, ROC, & Sub-optimality
Arpan Mukherjee, Marcello Bullo, Debabrota Basu, Deniz G\"und\"uz

TL;DR
This paper models test-time verification in large language models as an optimal transport problem, revealing three regimes of sub-optimality and coverage interaction, supported by empirical results across multiple models.
Contribution
It introduces a unified transport framework to analyze the interplay of coverage, verifier ROC, and sub-optimality in test-time scaling, and characterizes three distinct regimes.
Findings
Identification of three regimes: transport, policy improvement, saturation.
Analysis of sequential and batched sampling algorithms and their complexities.
Empirical validation with Qwen, Llama, and Gemma models.
Abstract
While test-time scaling with verification has shown promise in improving the performance of large language models (LLMs), the role of the verifier and its imperfections remain underexplored. The effect of verification manifests through interactions of three quantities: (i) the generator's coverage, (ii) the verifier's region of convergence (ROC), and (iii) the sampling algorithm's sub-optimality. Though recent studies capture subsets of these factors, a unified framework quantifying the geometry of their interplay is missing. We frame verifiable test-time scaling as a transport problem. This characterizes the interaction of coverage, ROC, and sub-optimality, and uncovers that the sub-optimality--coverage curve exhibits three regimes. A transport regime -- where sub-optimality increases with coverage, a policy improvement regime -- where sub-optimality may decrease with coverage,…
Peer Reviews
Decision·ICLR 2026 Poster
- The optimal transport formulation elegantly unifies the analysis of generator coverage, verifier imperfections, and sampling strategies. This provides a principled way to reason about test-time verification that goes beyond existing asymptotic analyses. - The paper derives closed-form expressions for sub-optimality (Theorems 3.6, 3.8, 3.10) and computational complexity (Theorems 3.2, 3.5) across multiple algorithms. These exact results are more informative than asymptotic bounds for practical
- All experiments use one GSM8K question (sample 2), following Dorner et al. (2025). While this protocol has precedent, it limits our ability to assess whether the three-regime structure generalizes across problems of varying difficulty and characteristics. Even testing on 5-10 representative problems would strengthen the claims. - SRS and SMC require knowing s_ver (the reference policy's mass on the verifier set), which may not be available at test time. Section 4 treats it as a hyperparameter
* Theoretical decomposition of sub-optimality into OTC is clear and a verifier-driven factor three-regime picture is intuitive and well explained. * Precise AiC analysis with explicit coverage violation condition and sub-optimality formulas. * SMC construction via maximal coupling with matching complexity to SRS; neatly links transport optimality with compute. * BRS exponential improvement with tight envelope; theoretically beats BoN in low/intermediate regimes. * Empirical studies match theoret
* GSM-8K is a very limited dataset, and especially as a test-time compute paper, I strongly recommend using a reasoning benchmark. You can either one or multiple of the AIME, OlympiadBench, MATH, GPQA or AMC questions to test the validity of the approaches here. * The analysis of hybrid approaches between sequential and batched protocols such as beam search (also called as block-wise Best of N) is required. * Actionable insights is not clear. How can the theoretical bounds and observations here
See summary.
See summary.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning and Algorithms
