Contrastive Preference Learning: Learning from Human Feedback without RL

Joey Hejna; Rafael Rafailov; Harshit Sikchi; Chelsea Finn; Scott; Niekum; W. Bradley Knox; Dorsa Sadigh

arXiv:2310.13639·cs.LG·May 1, 2024·2 cites

Contrastive Preference Learning: Learning from Human Feedback without RL

Joey Hejna, Rafael Rafailov, Harshit Sikchi, Chelsea Finn, Scott, Niekum, W. Bradley Knox, Dorsa Sadigh

PDF

Open Access 1 Repo 1 Models

TL;DR

This paper introduces Contrastive Preference Learning (CPL), a novel method for learning from human preferences without reinforcement learning, addressing limitations of existing RLHF approaches in high-dimensional and sequential tasks.

Contribution

The paper proposes CPL, a simple, off-policy algorithm that learns optimal policies directly from preferences using a contrastive objective, avoiding reward modeling and RL.

Findings

01

CPL scales to high-dimensional, sequential RLHF problems.

02

CPL outperforms reward-based methods in various settings.

03

CPL is simpler and more scalable than prior RLHF algorithms.

Abstract

Reinforcement Learning from Human Feedback (RLHF) has emerged as a popular paradigm for aligning models with human intent. Typically RLHF algorithms operate in two phases: first, use human preferences to learn a reward function and second, align the model by optimizing the learned reward via reinforcement learning (RL). This paradigm assumes that human preferences are distributed according to reward, but recent work suggests that they instead follow the regret under the user's optimal policy. Thus, learning a reward function from feedback is not only based on a flawed assumption of human preference, but also leads to unwieldy optimization challenges that stem from policy gradients or bootstrapping in the RL phase. Because of these optimization challenges, contemporary RLHF methods restrict themselves to contextual bandit settings (e.g., as in large language models) or limit observation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jhejna/cpl
pytorchOfficial

Models

🤗
prhegde/aligned-merge-aanaphi-phi2-orage-3b
model· 5 dl
5 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research

MethodsALIGN