TL;DR
VRPRM introduces a visual reasoning-based process reward model that enhances reasoning capabilities of language models efficiently, with less annotation cost, surpassing previous models significantly.
Contribution
The paper proposes VRPRM, a novel visual reasoning approach for process reward modeling, along with an efficient two-stage training strategy that improves reasoning performance with less data.
Findings
VRPRM surpasses non-thinking PRMs with 400K data.
Achieves up to 118% performance improvement in BoN experiments.
Reduces data annotation costs while enhancing reasoning capabilities.
Abstract
Process Reward Model (PRM) is widely used in the post-training of Large Language Model (LLM) because it can perform fine-grained evaluation of the reasoning steps of generated content. However, most PRMs lack long-term reasoning and deep thinking capabilities. On the other hand, although a few works have tried to introduce Chain-of-Thought (CoT) capability into PRMs, the annotation cost of CoT-PRM data is too expensive to play a stable role in various tasks. To address the above challenges, we propose VRPRM, a process reward model via visual reasoning, and design an efficient two-stage training strategy. Experimental results show that using only 3.6K CoT-PRM Supervised Fine-Tuning(SFT) data and 50K non-CoT PRM Reinforcement Learning (RL) training data, VRPRM can surpass the non-thinking PRM with a total data volume of 400K and achieved a relative performance improvement of up to 118\%…
Peer Reviews
Decision·Submitted to ICLR 2026
1. **Effective Multi-stage Training for Data Efficiency:** The core methodology, combining a small, high-quality SFT CoT dataset with a larger, less costly RL dataset, is highly practical. The results strongly suggest that the initial SFT phase effectively primes the model for complex reasoning, allowing the subsequent RL stage to generalize this capability robustly and efficiently, overcoming the high annotation cost typically associated with CoT-PRM data. 2. **Demonstrated Performance Gains:*
1. **Increased Computational Overhead:** The explicit nature of the Process Reward Model, which generates a step-by-step judgment (a reasoning trace of its own) for the solver's output, inevitably introduces significant computational overhead compared to non-reasoning, end-to-end reward models (which only generate a single score). The paper does not provide a quantitative analysis of this overhead (e.g., token generation time or total inference latency) relative to simpler PRM or reward model ba
Clear, practical PRM pipeline: The two-stage training design (CoT-PRM SFT → RL on non-CoT) is well-motivated and carefully specified, with strict format/quality checks. Strong empirical results on process evaluation: On VisualProcessBench, VRPRM-7B variants outperform prior work (incl. VisualPRM) on both FEI/AEI.
Cross-family generalization not fully established: Most experiments pair VRPRM with InternVL2.5 policies for test-time scaling, with no results on other families (e.g., Qwen-VL or GPT-class). RL training dynamics are underreported: The paper does not show reward trajectories or response-length curves during RL, making it hard to diagnose the training process. Higher inference overhead for CoT-PRM: Compared to PRMs that output a single {+/-} token, the CoT-PRM requires structured reasoning and
- **Simple two-stage recipe combining CoT and RL for easy adoption.** The paper first uses SFT on a small, structured CoT-PRM dataset to seed reasoning, then applies RL on larger non-CoT PRM data to scale. CoT and RL are explicitly integrated in a multimodal PRM, and rewards cover both format and process, keeping implementation straightforward. - **Stronger process supervision with clear empirical gains.** On *VisualProcessBench* it surpasses prior PRMs on process metrics. ablations show removin
- My main concern is the inconsistency between the experimental settings of Table 1 and Table 2. In Table 1, the authors evaluate VRPRM using Qwen and MiMo backbones to analyze the model’s process reasoning ability, but in Table 2, they switch to the InternVL2.5 family as the policy model for BoN testing without specifying which version of VRPRM serves as the critic, and they omit the MiMo results entirely. This inconsistency makes the two tables difficult to compare and raises questions about t
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
