Discriminative Policy Optimization for Token-Level Reward Models

Hongzhan Chen; Tao Yang; Shiping Gao; Ruijun Chen; Xiaojun Quan; Hongtao Tian; Ting Yao

arXiv:2505.23363·cs.CL·May 30, 2025

Discriminative Policy Optimization for Token-Level Reward Models

Hongzhan Chen, Tao Yang, Shiping Gao, Ruijun Chen, Xiaojun Quan, Hongtao Tian, Ting Yao

PDF

Open Access 1 Repo

TL;DR

This paper introduces a discriminative policy optimization method called Q-RM that improves token-level reward modeling for language models, leading to more stable training and better reasoning performance.

Contribution

It proposes a novel Q-function based reward model that decouples reward assignment from language generation, enhancing stability and accuracy in token-level reward optimization.

Findings

01

Q-RM outperforms baseline methods on various benchmarks.

02

Training with Q-RM converges significantly faster than ORM and PRM.

03

Q-RM improves reasoning task scores by several points.

Abstract

Process reward models (PRMs) provide more nuanced supervision compared to outcome reward models (ORMs) for optimizing policy models, positioning them as a promising approach to enhancing the capabilities of LLMs in complex reasoning tasks. Recent efforts have advanced PRMs from step-level to token-level granularity by integrating reward modeling into the training of generative models, with reward scores derived from token generation probabilities. However, the conflict between generative language modeling and reward modeling may introduce instability and lead to inaccurate credit assignments. To address this challenge, we revisit token-level reward assignment by decoupling reward modeling from language generation and derive a token-level reward model through the optimization of a discriminative policy, termed the Q-function Reward Model (Q-RM). We theoretically demonstrate that Q-RM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

homzer/q-rm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRecommender Systems and Techniques · Explainable Artificial Intelligence (XAI) · Machine Learning in Healthcare