OPV: Outcome-based Process Verifier for Efficient Long Chain-of-Thought Verification
Zijian Wu, Lingkai Kong, Wenwei Zhang, Songyang Gao, Yuzhe Gu, Zhongrui Cai, Tianyou Ma, Yuhong Liu, Zhi Wang, Runyuan Ma, Guangyu Wang, Wei Li, Conghui He, Dahua Lin, Kai Chen

TL;DR
The paper introduces OPV, an outcome-based process verifier that efficiently verifies long chain-of-thought reasoning in large language models, achieving state-of-the-art results with fewer annotations.
Contribution
It proposes a novel OPV framework combining outcome and process verification, enhanced by active learning and rejection fine-tuning for scalable, accurate reasoning verification.
Findings
OPV achieves an F1 score of 83.1 on OPV-Bench, surpassing larger models.
OPV detects false positives effectively, aligning closely with expert assessments.
Using OPV improves policy model performance, increasing accuracy from 55.2% to 73.3%.
Abstract
Large language models (LLMs) have achieved significant progress in solving complex reasoning tasks by Reinforcement Learning with Verifiable Rewards (RLVR). This advancement is also inseparable from the oversight automated by reliable verifiers. However, current outcome-based verifiers (OVs) are unable to inspect the unreliable intermediate steps in the long reasoning chains of thought (CoTs). Meanwhile, current process-based verifiers (PVs) have difficulties in reliably detecting errors in the complex long CoTs, limited by the scarcity of high-quality annotations due to the prohibitive costs of human annotations. Therefore, we propose the Outcome-based Process Verifier (OPV), which verifies the rationale process of summarized outcomes from long CoTs to achieve both accurate and efficient verification and enable large-scale annotation. To empower the proposed verifier, we adopt an…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Clear Problem Motivation: The paper identifies real limitations of OV (ignores process) and PV (expensive for long CoTs), motivating the need for alternative approaches. 2. Substantial Data Collection Effort: Curating 40k expert-annotated solutions with rigorous three-expert consensus protocol represents significant engineering work.
### 1. Unsubstantiated Efficiency Claims Total cost = Summarization (671B DeepSeek-V3) + Verification (32B OPV) Paper provides ZERO computational cost measurements: no latency, FLOPs, or token counts. It claims "efficient" repeatedly but likely more expensive than vanilla PV. This undermines the entire motivation ### 2. Unfair Experimental Comparisons - Baselines: Qwen3-Max, R1, etc. use zero-shot prompting - OPV: Uses 40k expert annotations + iterative training + RL This compares "supervis
1. Reasonable methodological approach As the complexity of problems increases, the difficulty of verifying thought processes (CoTs) rises, as well. Therefore, linearization and simplification of long CoTs can make the verification problem much smaller and easier, and the proposed summarization approach sounds like a fair approach in this line. 2. Human-in-the-loop pipeline for practical applicability Taking advantage of the simplified verification process thanks to the summarization, the au
1. Dependency of OPV on summarization accuracy Viewing the originally generated CoTs (+ the final answers) as inputs, this work ultimately proposes to factor a process-based verifier into two components: (a) the summarizer and (b) the summarized process-based verifier. However, this work primarily focuses on (b). While the authors stress the importance of the correct summarization and mention re-summarization, it remains at a qualitative level. Importantly, it looks that the summarized CoTs ar
+ Existing PRM approaches use Monte Carlo rollouts which are noisy. Different from previous automated rollouts, this paper utilizes humans in the loop to improve the data and model quality iteratively. + Curate and iterate to create the OPV-Bench dataset, which could be valuable for process verification methods.
- Missing model comparison. This work is positioned in-between the ORM and PRM, but does not provide direct comparisons with existing methods (e.g., Qwen PRM or other ORM baselines). For the benchmark it would be important to understand 1. How do the other models compare on the process bench 2? How do the other models compare with the OPV-BENCH? - Missing Dataset construction details. Authors only mentioned problem curation as K-12 education, high-school competitions, and undergraduate-level m
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning in Healthcare · Multimodal Machine Learning Applications
