Parallel Test-Time Scaling with Multi-Sequence Verifiers
Yegon Kim, Seungyoo Lee, Chaeyun Jang, Hyungi Lee, Juho Lee

TL;DR
This paper introduces the Multi-Sequence Verifier (MSV), a novel joint-processing verifier for parallel test-time scaling in large language models, improving answer selection and reducing latency through early-stopping strategies.
Contribution
The paper proposes MSV, the first verifier to jointly process multiple candidates, enhancing calibration and enabling efficient early-stopping in parallel decoding scenarios.
Findings
MSV improves answer selection accuracy.
MSV reduces latency by approximately 50% with maintained accuracy.
Joint processing of candidates outperforms isolated scoring methods.
Abstract
Parallel test-time scaling, which generates multiple candidate solutions for a single problem, is a powerful technique for improving large language model performance. However, it is hindered by two key bottlenecks: accurately selecting the correct solution from the candidate pool, and the high inference latency from generating many full solutions. We argue that both challenges are fundamentally linked to verifier calibration. A well-calibrated verifier not only improves answer selection, but also enables early-stopping strategies to reduce latency. However, existing verifiers are limited as they score each candidate in isolation, overlooking rich contextual information across the set of candidates. To address this, we introduce the Multi-Sequence Verifier (MSV), the first verifier designed to jointly process all candidate solutions and model their interactions. MSV achieves improved…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper tackles an important problem of lack of accurate, calibrated, and fast inference with existing verifiers. To mitigate this, the paper proposes multi-mask training in the final answer and streaming answer scenarios. 2. Ultimately, the proposed method provides decent performance improvements across diverse evaluation benchmarks. Further, it seems to be working better than pertinent baselines such as MSV_1, and Probe. 3. The paper also shows that the MSV achieves better calibration
1. The experiments are performed with just one model size and model family i.e., deepseek-r1-distill-qwen-1.5B. It would be better to try the method on more models and at various sizes. 2. It feels that having many attention masks that operate on similar sequences is a bit of an overkill. If you have enabled full attention (every sequence attends to every other thing), it remains unclear why other attention masks are needed in practice. There is no ablation which shows that each attention mas
Overall, the work makes three key contributions: (1) a novel design of MSV as the multi-sequence joint verifier, (2) the demonstration that superior verifier calibration directly improves parallel scaling performance, (3) the introduction of a practical, low-latency parallel early-stopping framework enabled by streaming MSV, which fundamentally rethinks how test-time compute can be scaled without proportional latency costs.
Overall, the paper presents a moderately novel approach but falls short of a truly significant conceptual leap. The work proposes two main contributions: (i) an improved verifier for best-of-N selection, and (ii) a streaming early-stopping framework. (1) For the first, the core innovation lies in explicitly incorporating the proportion of sequences that produce symbolically equivalent answers (i.e., consensus frequency) as an auxiliary feature, while still relying on standard attention mechanis
### Writing The paper is very well written and well structured. All concepts are cleanly defined and explained (except a unifying overview of the method, see "Weaknesses" below). ### Contribution The proposed method seems quite novel and interesting. It features various novel ideas (see section 3). ### Results The results are look promising. Not only does MSV seem to lead to accuracy gains, but it also seems to lead to well-calibrated models, which is possibly independently relevant to the res
### Calibration From a theoretical point of view, it's not clear (at least to me) that the proposed procedure of computing $\tilde{y}$ (which I'm assuming is used as the predictive probability $p$ in section 3.3) actually enforces calibration (even if the experiments show this happens empirically). Yet the authors make it seem like this naturally follows from the proposed training objective (e.g. in line 299). If this is indeed supported by theory, the authors should expand on it. ### Relation
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
