TL;DR
StepWiser introduces a generative judge that reasons about intermediate reasoning steps in complex models, improving judgment accuracy and training efficiency through reinforcement learning-based supervision.
Contribution
It reframes stepwise reward modeling as a reasoning task, enabling a generative judge to provide explanations and enhance model training and inference.
Findings
Outperforms existing methods in judgment accuracy on intermediate steps
Enhances policy model training with better feedback
Improves inference-time search efficiency
Abstract
As models increasingly leverage multi-step reasoning strategies to solve complex problems, supervising the logical validity of these intermediate steps has become a critical research challenge. Process reward models address this by providing step-by-step feedback, but current approaches have two major drawbacks: they typically function as classifiers without providing explanations, and their reliance on supervised fine-tuning with static datasets limits generalization. Inspired by recent advances, we reframe stepwise reward modeling from a classification task to a reasoning task itself. We thus propose a generative judge that reasons about the policy model's reasoning steps (i.e., meta-reasons), outputting thinking tokens before delivering a final verdict. Our model, StepWiser, is trained by reinforcement learning using relative outcomes of rollouts. We show it provides (i) better…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- The idea of reframing stepwise evaluation from classification to reasoning is interesting. - The paper is well written. Each component of their methods is well explained. Strong ablations clearly isolate the contribution of each component and demonstrate improvements in both training-time reward modelling and inference-time search. - The results show that their method provides improved inference-time search and better training-time rewards.
- Despite its framing, the approach does not fundamentally innovate beyond concurrent work on generative process reward models. The “reasoning about reasoning” claim is largely conceptual—since the judge’s reasoning trace is not used in the loss or supervision signal. The RL reward depends solely on the final verdict label. - All results focus on final-answer correctness. There is no manual evaluation of the judge’s faithfulness or calibration. Annotating a small set of reasoning steps and comp
1. The paper is well written, and the method is easy to understand. 2. The experiments show significant performance improvements. 3. The ablation shows that the proposed components are valid.
1. The main experiments are only conducted based on Qwen2.5, which involves data contamination on several datasets. 2. The method is costly (around 5 days on 8*A100 GPUs), and the paper does not provide fair or cost-comparable baselines, like an RL-trained reasoning GenRMs (cost-competitive reasoning GenRMs without process-level reward). 3. The authors do not prove whether the improvement of the model's judging ability is related to data pollution, and whether the model's judging ability strongl
This paper presents a well-motivated and technically sound contribution to the emerging area of process-level reward modeling for reasoning-intensive LLMs. Its central innovation lies in reframing stepwise evaluation as a generative reasoning task, where the judge model is trained to “reason about reasoning.” The proposed STEPWISER framework is conceptually elegant and empirically convincing: it combines (1) self-segmentation of chains-of-thought into coherent “chunks of thought,” (2) Monte-Carl
While the proposed framework is conceptually strong, several limitations reduce its scientific rigor and generality. First, all experiments are conducted exclusively on Qwen2.5-based models (1.5B and 7B). These models are known to have been exposed to extensive mathematical corpora during pretraining, leading to potential data contamination and evaluation leakage on benchmarks such as GSM8K, MATH, and ProcessBench. This raises concerns about the true generalization of the proposed method beyond
1. The paper creatively shifts the paradigm from a simple classification task (discriminative PRMs) to a generative reasoning task, forcing the judge to "meta-reason" by producing a CoT. 2. The authors provide comprehensive ablations that clearly isolate the individual performance gains from using RL over SFT, generative CoT over discriminative formats, and dataset balancing. The practical utility is convincingly shown through superior results in both inference-time search ("Chunk-Reset Reasonin
1. The paper's primary weakness is its incremental novelty. The core components (generative verifiers, MC rollouts, RL training) have been individually explored in recent works. However, the authors provide a comprehensive and systematic integration of this entire pipeline. I will give a positive view of its contribution. 2. Significant Computational and Latency Overheads. The multi-stage pipeline—requiring extensive MC rollouts for data annotation (14 days) followed by a full online RL run (5 d
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
