Policy-labeled Preference Learning: Is Preference Enough for RLHF?

Taehyun Cho; Seokhun Ju; Seungyub Han; Dohyeong Kim; Kyungjae Lee; Jungwoo Lee

arXiv:2505.06273·cs.LG·May 14, 2025

Policy-labeled Preference Learning: Is Preference Enough for RLHF?

Taehyun Cho, Seokhun Ju, Seungyub Han, Dohyeong Kim, Kyungjae Lee, Jungwoo Lee

PDF

1 Video

TL;DR

This paper introduces Policy-labeled Preference Learning (PPL), a novel approach that models human preferences with regret to improve reward alignment in RLHF, addressing likelihood mismatch issues and enhancing performance in continuous control tasks.

Contribution

PPL offers a new method that directly models human preferences with regret, improving reward learning and policy optimization in RLHF beyond existing techniques.

Findings

01

PPL significantly improves offline RLHF performance.

02

PPL is effective in online decision-making tasks.

03

Contrastive KL regularization enhances sequential decision making.

Abstract

To design rewards that align with human goals, Reinforcement Learning from Human Feedback (RLHF) has emerged as a prominent technique for learning reward functions from human preferences and optimizing policies via reinforcement learning algorithms. However, existing RLHF methods often misinterpret trajectories as being generated by an optimal policy, causing inaccurate likelihood estimation and suboptimal learning. Inspired by Direct Preference Optimization framework which directly learns optimal policy without explicit reward, we propose policy-labeled preference learning (PPL), to resolve likelihood mismatch issues by modeling human preferences with regret, which reflects behavior policy information. We also provide a contrastive KL regularization, derived from regret-based principles, to enhance RLHF in sequential decision making. Experiments in high-dimensional continuous control…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Policy-labeled Preference Learning: Is Preference Enough for RLHF?· slideslive

Taxonomy

MethodsALIGN