Solve-Detect-Verify: Inference-Time Scaling with Flexible Generative Verifier
Jianyuan Zhong, Zeju Li, Zhijian Xu, Xiangyu Wen, Kezhi Li, Qiang Xu

TL;DR
This paper introduces FlexiVe, a flexible generative verifier, and the Solve-Detect-Verify pipeline to improve reasoning accuracy and efficiency of large language models during inference by intelligently balancing verification resources.
Contribution
The paper presents a novel inference-time scaling framework with FlexiVe and the Solve-Detect-Verify pipeline, enabling efficient and accurate LLM reasoning through adaptive verification strategies.
Findings
FlexiVe achieves superior error detection accuracy on ProcessBench.
The full approach outperforms baselines on AIME and CNMO benchmarks.
Enhanced reasoning accuracy and inference efficiency demonstrated.
Abstract
Large Language Model (LLM) reasoning for complex tasks inherently involves a trade-off between solution accuracy and computational efficiency. The subsequent step of verification, while intended to improve performance, further complicates this landscape by introducing its own challenging trade-off: sophisticated Generative Reward Models (GenRMs) can be computationally prohibitive if naively integrated with LLMs at test-time, while simpler, faster methods may lack reliability. To overcome these challenges, we introduce FlexiVe, a novel generative verifier that flexibly balances computational resources between rapid, reliable fast thinking and meticulous slow thinking using a Flexible Allocation of Verification Budget strategy. We further propose the Solve-Detect-Verify pipeline, an efficient inference-time scaling framework that intelligently integrates FlexiVe, proactively identifying…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. Reasonable high-level approach It can be a reasonable high-level approach to first perform an efficient inference and then conditionally escalate it to a more computationally intensive inference, depending on the difficulty of the problem. With this theme, I think the proposed approach conceptually makes sense. 2. Comprehensive information and reproducibility The authors provide most of the details of their approach. They present not only various statistics from the experimental results
1. Scalability of the Flex mode For FlexiVe, the authors use the agreement ratio given k answers generated with the no-thinking mode as a consensus to determine whether to proceed with the original thinking mode or stick to the answer generated with the no-thinking mode. They describe that a high consensus "signals a straightforward case" (L216). While I agree that this could roughly hold, I have a concern that it may not scale well. Specifically, using a high consensus as a condition to not p
1. The paper proposes a new verifier that can flexibly switch between thinking fast and slow, and an approach to decide when to think longer based on the difficulty of the problem. This improves both performance and efficiency. 1. The paper also contributes an approach to incorporate the verifier in the overall pipeline. Prior works typically use the verifier to select the best out of N solutions (known as best-of-N). They propose a new pipeline called Solve-detect-verify, which iteratively refi
The idea of iterative refinement is not entirely new (for example, [1]), which affects the novelty of the paper. Writing is confusing at some places: 1. Line 165, 167 – this could be elaborated. Maybe talk about the architectures considered in the paper, and also explain what a process-based reward model is (maybe contrast with outcome reward model). 2. The paper states “fast thinking” is the same as “no think”, but in “no think”, there should be no reasoning trace and the model should directly
1. The proposed method shows a certain level of novelty, especially in the engineering side. For example, the fast/slow thinking parts. 2. Empirical evaluation shows a consistent improvement against existing models. The efficiency analysis is also valuable.
1. The proposed method mostly relies on the hand-crafted hack, which **lacks a clear justification/ablation** and shows few contributions to the machine learning side. - For example, it remains unclear to me why you need to predict the first error step index in an auto-regressive way. As the simplest approach, you can just train a scalar model to output the error probability on the end token of each step. - Second, the completion assessment and hesitation words are more like **hacking of DeepSe
+ comparison against GENPRM, majority voting, and thinking models without the reward model. + quantification of the important problem with current reasoning models --- overthinking (models search for additional solutions, and potential mistakes for a significant amount of time) + advancing the frontier of generative PRMs + analysis of statistical significance + quantification of RL vs supervised tuning
+ minor: as noted by the authors, the current approach to overthinking detection can fail to generalize, as it is mostly based on keyword detection + minor (as I understand it is not the case for AIME) benchmark problem: f1 score can be noisy
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Topic Modeling · Artificial Intelligence in Healthcare and Education
