Sharp Analysis for KL-Regularized Contextual Bandits and RLHF
Heyang Zhao, Chenlu Ye, Quanquan Gu, Tong Zhang

TL;DR
This paper provides a sharp theoretical analysis demonstrating that KL-regularization significantly improves sample complexity in contextual bandits and RLHF, reducing it from (1/^2) to (1/) under certain conditions.
Contribution
It is the first to theoretically establish the power of KL-regularization with a sharp analysis, and explores the impact of data coverage on RLHF sample complexity.
Findings
KL-regularization reduces sample complexity to (1/) for small 05.
A simple two-stage sampling strategy achieves near-optimal sample complexity with sufficient coverage.
Theoretical insights clarify the roles of KL-regularization and data coverage in RLHF.
Abstract
Reverse-Kullback-Leibler (KL) regularization has emerged to be a predominant technique used to enhance policy optimization in reinforcement learning (RL) and reinforcement learning from human feedback (RLHF), which forces the learned policy to stay close to a reference policy. While the effectiveness and necessity of KL-regularization have been empirically demonstrated in various practical scenarios, current theoretical analysis of KL-regularized RLHF still obtains the same sample complexity as problems without KL-regularization. To understand the fundamental distinction between policy learning objectives with KL-regularization and ones without KL-regularization, we are the first to theoretically demonstrate the power of KL-regularization by providing a sharp analysis for KL-regularized contextual bandits and RLHF, revealing an …
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Distributed Sensor Networks and Detection Algorithms
