Learning Reward and Policy Jointly from Demonstration and Preference Improves Alignment
Chenliang Li, Siliang Zeng, Zeyi Liao, Jiaxiang Li, Dongyeop Kang,, Alfredo Garcia, Mingyi Hong

TL;DR
This paper introduces a unified approach called AIHF that jointly learns reward models and policies from demonstrations and preferences, improving alignment in large language models and robotic control.
Contribution
The paper presents a single-stage method for joint reward and policy learning that outperforms traditional multi-stage approaches like RLHF and DPO, especially with limited preference data.
Findings
AIHF outperforms RLHF and DPO in experiments.
The approach effectively utilizes limited high-quality preference data.
It simplifies existing alignment pipelines with minor modifications.
Abstract
Aligning human preference and value is an important requirement for building contemporary foundation models and embodied AI. However, popular approaches such as reinforcement learning with human feedback (RLHF) break down the task into successive stages, such as supervised fine-tuning (SFT), reward modeling (RM), and reinforcement learning (RL), each performing one specific learning task. Such a sequential approach results in serious issues such as significant under-utilization of data and distribution mismatch between the learned reward model and generated policy, which eventually lead to poor alignment performance. We develop a single stage approach named Alignment with Integrated Human Feedback (AIHF), capable of integrating both human preference and demonstration to train reward models and the policy. The proposed approach admits a suite of efficient algorithms, which can easily…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Modeling and Causal Inference · Data Stream Mining Techniques
MethodsDirect Preference Optimization
