Beyond Scalar Reward Model: Learning Generative Judge from Preference Data

Ziyi Ye; Xiangsheng Li; Qiuchi Li; Qingyao Ai; Yujia Zhou; Wei Shen; Dong Yan; Yiqun Liu

arXiv:2410.03742·cs.CL·September 3, 2025·2 cites

Beyond Scalar Reward Model: Learning Generative Judge from Preference Data

Ziyi Ye, Xiangsheng Li, Qiuchi Li, Qingyao Ai, Yujia Zhou, Wei Shen, Dong Yan, Yiqun Liu

PDF

Open Access 1 Models

TL;DR

This paper introduces a generative judge trained from preference data using LLMs to produce rationales and judgments, improving interpretability and robustness over traditional scalar reward models.

Contribution

It proposes a novel method to train a generative judge with self-generated contrastive judgments, eliminating the need for a reward head and enhancing interpretability and bias robustness.

Findings

01

Performance comparable to scalar reward models on preference data

02

Superior interpretability due to natural language rationales

03

Greater robustness against dataset biases

Abstract

Learning from preference feedback is a common practice for aligning large language models~(LLMs) with human value. Conventionally, preference data is learned and encoded into a scalar reward model that connects a value head with an LLM to produce a scalar score as preference or reward. However, scalar models lack interpretability and are known to be susceptible to biases in datasets. This paper investigates leveraging the generation capability of LLMs to address both limitations in one shot. Specifically, we prompt the pre-trained LLM to generate positive and negative judgments, both supported with rationales in natural language form. The self-generated contrastive judgment pairs are used to train the generative judge with Direct Preference Optimization (DPO). This proposal of training the generative Judge using self-generated Contrastive judgments (Con-J) ensures natural…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
ZiyiYe/Con-J-Qwen2-7B
model· 212 dl· ♡ 2
212 dl♡ 2

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBayesian Modeling and Causal Inference · Game Theory and Voting Systems · Multi-Criteria Decision Making