The Power of Active Multi-Task Learning in Reinforcement Learning from Human Feedback
Ruitao Chen, Liwei Wang

TL;DR
This paper models reinforcement learning from human feedback as a contextual dueling bandit problem, proposing a task relevance-aware sampling strategy that reduces sample complexity and enhances learning efficiency.
Contribution
It introduces a novel formulation of RLHF as a contextual dueling bandit problem with a linear representation, and develops an algorithm that adaptively allocates samples based on task relevance.
Findings
Sample complexity is reduced by considering task relevance.
The proposed method achieves ε-optimality with fewer source task samples.
Target task sample complexity scales linearly with latent space dimension.
Abstract
Reinforcement learning from human feedback (RLHF) has contributed to performance improvements in large language models. To tackle its reliance on substantial amounts of human-labeled data, a successful approach is multi-task representation learning, which involves learning a high-quality, low-dimensional representation from a wide range of source tasks. In this paper, we formulate RLHF as the contextual dueling bandit problem and assume a common linear representation. We demonstrate that the sample complexity of source tasks in multi-task RLHF can be reduced by considering task relevance and allocating different sample sizes to source tasks with varying task relevance. We further propose an algorithm to estimate task relevance by a small number of additional data and then learn a policy. We prove that to achieve optimal, the sample complexity of the source tasks can be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Evolutionary Algorithms and Applications
