Process Reward Model with Q-Value Rankings
Wendi Li, Yixuan Li

TL;DR
This paper introduces the Process Q-value Model (PQM), a new framework for process reward modeling that uses Q-value rankings and a comparative loss function to better capture decision interdependencies, outperforming traditional classification methods.
Contribution
The paper presents PQM, a novel Markov Decision Process-based approach for process reward modeling that improves reward distribution and decision interdependency understanding.
Findings
PQM outperforms classification-based PRMs across various benchmarks.
The comparative loss function enhances the model's ability to capture decision dynamics.
Ablation studies confirm the effectiveness of the proposed approach.
Abstract
Process Reward Modeling (PRM) is critical for complex reasoning and decision-making tasks where the accuracy of intermediate steps significantly influences the overall outcome. Existing PRM approaches, primarily framed as classification problems, employ cross-entropy loss to independently evaluate each step's correctness. This method can lead to suboptimal reward distribution and does not adequately address the interdependencies among steps. To address these limitations, we introduce the Process Q-value Model (PQM), a novel framework that redefines PRM in the context of a Markov Decision Process. PQM optimizes Q-value rankings based on a novel comparative loss function, enhancing the model's ability to capture the intricate dynamics among sequential decisions. This approach provides a more granular and theoretically grounded methodology for process rewards. Our extensive empirical…
Peer Reviews
Decision·ICLR 2025 Poster
- The proposed ranking loss for training PRMs empirically improves best-of-N on MATH500 dataset, for multiple base LLMs. In particular, it outperforms prior work Wang et. al., that trains the PRM with a BCE loss. - The analysis in Section 4.3 is insightful and shows that using a margin based ranking (ablating on $\zeta$) improves best-of-N performance when using the PQMs as an ORM (taking the min score over individual steps).
- The trained verifier is only used for best-of-N which is not its most promising use case. Evaluating its efficacy for beam-search where the PQM ranks intermediate generations is needed to demonstrate why practitioners should train PQMs. - Training with the binary cross-entropy loss, where the labels for each prefix are the expected future reward (some value between 0 and 1), will also distinguish prefixes across problems, and maybe the difference in expected rewards for prefixes can be accent
Overall, the paper is overall well written (see one remark in Weaknesses) and easy to follow. The paper extends the existing PRM framework to a more general PQM framework which uses Q-values instead of intermediate rewards, which allows to capture the dependency between reasoning states (rather than having these states to be independent). This extension is natural and well motivated. The paper provides empirical study highlighting the effectiveness of the proposed method. Moreover, the paper p
A suggestion to slightly improve presentation. In Section 3.3, it would be helpful to outline the overall objective for the theorem 3.5 (what do we want to prove any why), and then outline the plan for this proof (why we need other lemmas). From the results, it is unclear why (Figure 2) the gap in performance between SC+PQM and PQM is larger as we go to the right. It would be helpful if the authors add explanations of why they believe it is happening.
- Sections 2 and 3 are well written and provide sufficient context of the field as well as the proposed method - The experiments are comprehensive and, without a doubt, the proposed method achieves high performance across multiple LLM backends and math benchmarks, making it a strong contribution.
- Section 3 could benefit from some intuition. While the content seems sound, some higher-level guidance could be beneficial (this is just a minor issue) - Experiments on self-consistency could benefit from some additional explanations. In particular, given figure-2 (right), why is self-consistency particularly important for larger models, given the clear difference between PQM and PQM+SC only in the 70B model? Providing additional insights here would be very useful. - I would recommend adding
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data and Business Intelligence
