Pre-Trained Policy Discriminators are General Reward Models

Shihan Dou; Shichun Liu; Yuming Yang; Yicheng Zou; Yunhua Zhou; Shuhao Xing; Chenhao Huang; Qiming Ge; Demin Song; Haijun Lv; Songyang Gao; Chengqi Lv; Enyu Zhou; Honglin Guo; Zhiheng Xi; Wenwei Zhang; Qipeng Guo; Qi Zhang; Xipeng Qiu; Xuanjing Huang; Tao Gui; Kai Chen

arXiv:2507.05197·cs.CL·January 21, 2026

Pre-Trained Policy Discriminators are General Reward Models

Shihan Dou, Shichun Liu, Yuming Yang, Yicheng Zou, Yunhua Zhou, Shuhao Xing, Chenhao Huang, Qiming Ge, Demin Song, Haijun Lv, Songyang Gao, Chengqi Lv, Enyu Zhou, Honglin Guo, Zhiheng Xi, Wenwei Zhang, Qipeng Guo, Qi Zhang, Xipeng Qiu, Xuanjing Huang, Tao Gui, Kai Chen

PDF

Open Access 4 Models 1 Video

TL;DR

This paper introduces a novel reward modeling approach as a policy discriminator, using pre-training to create scalable, high-performing reward models that generalize well across tasks and improve reinforcement learning policies.

Contribution

It proposes POLAR, a pre-training method for reward models based on policy discrimination, which outperforms traditional methods and demonstrates strong generalization and scaling properties.

Findings

01

POLAR-7B improves preference accuracy significantly.

02

POLAR enhances RLHF policy performance across benchmarks.

03

Reward models exhibit a power-law scaling relationship.

Abstract

We offer a novel perspective on reward modeling by formulating it as a policy discriminator, which quantifies the difference between two policies to generate a reward signal, guiding the training policy towards a target policy with desired behaviors. Based on this conceptual insight, we propose a scalable pre-training method named Policy Discriminative Learning (POLAR), which trains a reward model (RM) to discern identical policies and discriminate different ones. Unlike traditional reward modeling methods relying on absolute preferences, POLAR captures the relative difference between one policy and an arbitrary target policy, which is a scalable, high-level optimization objective suitable for modeling generic ranking relationships. Leveraging the POLAR pre-training paradigm, we present a series of RMs with parameter scales from 1.8B to 7B. Empirical results show that POLAR…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

Pre-Trained Policy Discriminators are General Reward Models· slideslive

Taxonomy

TopicsRecommender Systems and Techniques · Machine Learning and Data Classification · Emotion and Mood Recognition