Solve-Detect-Verify: Inference-Time Scaling with Flexible Generative Verifier

Jianyuan Zhong; Zeju Li; Zhijian Xu; Xiangyu Wen; Kezhi Li; Qiang Xu

arXiv:2505.11966·cs.AI·May 20, 2025

Solve-Detect-Verify: Inference-Time Scaling with Flexible Generative Verifier

Jianyuan Zhong, Zeju Li, Zhijian Xu, Xiangyu Wen, Kezhi Li, Qiang Xu

PDF

Open Access 4 Reviews

TL;DR

This paper introduces FlexiVe, a flexible generative verifier, and the Solve-Detect-Verify pipeline to improve reasoning accuracy and efficiency of large language models during inference by intelligently balancing verification resources.

Contribution

The paper presents a novel inference-time scaling framework with FlexiVe and the Solve-Detect-Verify pipeline, enabling efficient and accurate LLM reasoning through adaptive verification strategies.

Findings

01

FlexiVe achieves superior error detection accuracy on ProcessBench.

02

The full approach outperforms baselines on AIME and CNMO benchmarks.

03

Enhanced reasoning accuracy and inference efficiency demonstrated.

Abstract

Large Language Model (LLM) reasoning for complex tasks inherently involves a trade-off between solution accuracy and computational efficiency. The subsequent step of verification, while intended to improve performance, further complicates this landscape by introducing its own challenging trade-off: sophisticated Generative Reward Models (GenRMs) can be computationally prohibitive if naively integrated with LLMs at test-time, while simpler, faster methods may lack reliability. To overcome these challenges, we introduce FlexiVe, a novel generative verifier that flexibly balances computational resources between rapid, reliable fast thinking and meticulous slow thinking using a Flexible Allocation of Verification Budget strategy. We further propose the Solve-Detect-Verify pipeline, an efficient inference-time scaling framework that intelligently integrates FlexiVe, proactively identifying…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 3

Strengths

1. Reasonable high-level approach It can be a reasonable high-level approach to first perform an efficient inference and then conditionally escalate it to a more computationally intensive inference, depending on the difficulty of the problem. With this theme, I think the proposed approach conceptually makes sense. 2. Comprehensive information and reproducibility The authors provide most of the details of their approach. They present not only various statistics from the experimental results

Weaknesses

1. Scalability of the Flex mode For FlexiVe, the authors use the agreement ratio given k answers generated with the no-thinking mode as a consensus to determine whether to proceed with the original thinking mode or stick to the answer generated with the no-thinking mode. They describe that a high consensus "signals a straightforward case" (L216). While I agree that this could roughly hold, I have a concern that it may not scale well. Specifically, using a high consensus as a condition to not p

Reviewer 02Rating 6Confidence 4

Strengths

1. The paper proposes a new verifier that can flexibly switch between thinking fast and slow, and an approach to decide when to think longer based on the difficulty of the problem. This improves both performance and efficiency. 1. The paper also contributes an approach to incorporate the verifier in the overall pipeline. Prior works typically use the verifier to select the best out of N solutions (known as best-of-N). They propose a new pipeline called Solve-detect-verify, which iteratively refi

Weaknesses

The idea of iterative refinement is not entirely new (for example, [1]), which affects the novelty of the paper. Writing is confusing at some places: 1. Line 165, 167 – this could be elaborated. Maybe talk about the architectures considered in the paper, and also explain what a process-based reward model is (maybe contrast with outcome reward model). 2. The paper states “fast thinking” is the same as “no think”, but in “no think”, there should be no reasoning trace and the model should directly

Reviewer 03Rating 2Confidence 4

Strengths

1. The proposed method shows a certain level of novelty, especially in the engineering side. For example, the fast/slow thinking parts. 2. Empirical evaluation shows a consistent improvement against existing models. The efficiency analysis is also valuable.

Weaknesses

1. The proposed method mostly relies on the hand-crafted hack, which **lacks a clear justification/ablation** and shows few contributions to the machine learning side. - For example, it remains unclear to me why you need to predict the first error step index in an auto-regressive way. As the simplest approach, you can just train a scalar model to output the error probability on the end token of each step. - Second, the completion assessment and hesitation words are more like **hacking of DeepSe

Reviewer 04Rating 8Confidence 3

Strengths

+ comparison against GENPRM, majority voting, and thinking models without the reward model. + quantification of the important problem with current reasoning models --- overthinking (models search for additional solutions, and potential mistakes for a significant amount of time) + advancing the frontier of generative PRMs + analysis of statistical significance + quantification of RL vs supervised tuning

Weaknesses

+ minor: as noted by the authors, the current approach to overthinking detection can fail to generalize, as it is mostly based on keyword detection + minor (as I understand it is not the case for AIME) benchmark problem: f1 score can be noisy

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Topic Modeling · Artificial Intelligence in Healthcare and Education