Contrastive Preference Learning: Learning from Human Feedback without RL
Joey Hejna, Rafael Rafailov, Harshit Sikchi, Chelsea Finn, Scott, Niekum, W. Bradley Knox, Dorsa Sadigh

TL;DR
This paper introduces Contrastive Preference Learning (CPL), a novel method for learning from human preferences without reinforcement learning, addressing limitations of existing RLHF approaches in high-dimensional and sequential tasks.
Contribution
The paper proposes CPL, a simple, off-policy algorithm that learns optimal policies directly from preferences using a contrastive objective, avoiding reward modeling and RL.
Findings
CPL scales to high-dimensional, sequential RLHF problems.
CPL outperforms reward-based methods in various settings.
CPL is simpler and more scalable than prior RLHF algorithms.
Abstract
Reinforcement Learning from Human Feedback (RLHF) has emerged as a popular paradigm for aligning models with human intent. Typically RLHF algorithms operate in two phases: first, use human preferences to learn a reward function and second, align the model by optimizing the learned reward via reinforcement learning (RL). This paradigm assumes that human preferences are distributed according to reward, but recent work suggests that they instead follow the regret under the user's optimal policy. Thus, learning a reward function from feedback is not only based on a flawed assumption of human preference, but also leads to unwieldy optimization challenges that stem from policy gradients or bootstrapping in the RL phase. Because of these optimization challenges, contemporary RLHF methods restrict themselves to contextual bandit settings (e.g., as in large language models) or limit observation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research
MethodsALIGN
