Learning Reward and Policy Jointly from Demonstration and Preference   Improves Alignment

Chenliang Li; Siliang Zeng; Zeyi Liao; Jiaxiang Li; Dongyeop Kang,; Alfredo Garcia; Mingyi Hong

arXiv:2406.06874·cs.AI·December 3, 2024

Learning Reward and Policy Jointly from Demonstration and Preference Improves Alignment

Chenliang Li, Siliang Zeng, Zeyi Liao, Jiaxiang Li, Dongyeop Kang,, Alfredo Garcia, Mingyi Hong

PDF

Open Access

TL;DR

This paper introduces a unified approach called AIHF that jointly learns reward models and policies from demonstrations and preferences, improving alignment in large language models and robotic control.

Contribution

The paper presents a single-stage method for joint reward and policy learning that outperforms traditional multi-stage approaches like RLHF and DPO, especially with limited preference data.

Findings

01

AIHF outperforms RLHF and DPO in experiments.

02

The approach effectively utilizes limited high-quality preference data.

03

It simplifies existing alignment pipelines with minor modifications.

Abstract

Aligning human preference and value is an important requirement for building contemporary foundation models and embodied AI. However, popular approaches such as reinforcement learning with human feedback (RLHF) break down the task into successive stages, such as supervised fine-tuning (SFT), reward modeling (RM), and reinforcement learning (RL), each performing one specific learning task. Such a sequential approach results in serious issues such as significant under-utilization of data and distribution mismatch between the learned reward model and generated policy, which eventually lead to poor alignment performance. We develop a single stage approach named Alignment with Integrated Human Feedback (AIHF), capable of integrating both human preference and demonstration to train reward models and the policy. The proposed approach admits a suite of efficient algorithms, which can easily…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBayesian Modeling and Causal Inference · Data Stream Mining Techniques

MethodsDirect Preference Optimization