Dual Active Learning for Reinforcement Learning from Human Feedback
Pangpang Liu, Chengchun Shi, Will Wei Sun

TL;DR
This paper introduces a dual active learning approach for reinforcement learning from human feedback, optimizing the selection of conversations and teachers to efficiently learn reward functions for aligning large language models with human preferences.
Contribution
It proposes a novel dual active reward learning algorithm combined with pessimistic RL, with theoretical guarantees and superior empirical performance.
Findings
The reward estimator achieves minimal generalized variance asymptotically.
The sub-optimality of the policy scales as O(1/√T) with sample size.
The method outperforms existing approaches in simulations and LLM experiments.
Abstract
Aligning large language models (LLMs) with human preferences is critical to recent advances in generative artificial intelligence. Reinforcement learning from human feedback (RLHF) is widely applied to achieve this objective. A key step in RLHF is to learn the reward function from human feedback. However, human feedback is costly and time-consuming, making it essential to collect high-quality conversation data for human teachers to label. Additionally, different human teachers have different levels of expertise. It is thus critical to query the most appropriate teacher for their opinions. In this paper, we use offline reinforcement learning (RL) to formulate the alignment problem. Motivated by the idea of -optimal design, we first propose a dual active reward learning algorithm for the simultaneous selection of conversations and teachers. Next, we apply pessimistic RL to solve the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Evolutionary Algorithms and Applications
