Sample-Efficient Reinforcement Learning from Human Feedback via Information-Directed Sampling

Han Qi; Haochen Yang; Qiaosheng Zhang; Zhuoran Yang

arXiv:2502.05434·cs.LG·August 11, 2025

Sample-Efficient Reinforcement Learning from Human Feedback via Information-Directed Sampling

Han Qi, Haochen Yang, Qiaosheng Zhang, Zhuoran Yang

PDF

Open Access

TL;DR

This paper introduces novel, sample-efficient algorithms for reinforcement learning from human feedback using information-directed sampling, with theoretical guarantees and practical approximations applicable to large language models.

Contribution

It develops IDS-based RLHF algorithms with theoretical regret bounds, introduces a surrogate environment and a new distance measure, and proposes a computationally efficient approximate method.

Findings

01

Achieves Bayesian regret bounds of order $O(H^{3/2}\sqrt{\log(K(\epsilon)) T})$

02

Specializes to tabular settings with regret of order $ ilde{O}(H^2\sqrt{SAT})$

03

Proposes an approximate IDS algorithm maintaining sample efficiency

Abstract

We study the problem of reinforcement learning from human feedback (RLHF), a critical problem in training large language models, from a theoretical perspective. Our main contribution is the design of novel sample-efficient RLHF algorithms based on information-directed sampling (IDS), an online decision-making principle inspired by information theory. Our algorithms maximize the sum of the value function and a mutual information term that encourages exploration of the unknown environment (which quantifies the information gained about the environment through observed human feedback data). To tackle the challenge of large state spaces and improve sample efficiency, we construct a simplified \emph{surrogate environment} and introduce a novel distance measure (named the \emph{ $ℓ_{g}$ -distance}), enabling our IDS-based algorithm to achieve a Bayesian regret upper bound of order…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Distributed Sensor Networks and Detection Algorithms · Neural Networks and Applications