Loading paper

Sharp Analysis for KL-Regularized Contextual Bandits and RLHF | Tomesphere

arXiv:2411.04625·cs.LG·February 12, 2025

Sharp Analysis for KL-Regularized Contextual Bandits and RLHF

Heyang Zhao, Chenlu Ye, Quanquan Gu, Tong Zhang

TL;DR

This paper provides a sharp theoretical analysis demonstrating that KL-regularization significantly improves sample complexity in contextual bandits and RLHF, reducing it from (1/^2) to (1/) under certain conditions.

Contribution

It is the first to theoretically establish the power of KL-regularization with a sharp analysis, and explores the impact of data coverage on RLHF sample complexity.

Findings

01

KL-regularization reduces sample complexity to (1/) for small 05.

02

A simple two-stage sampling strategy achieves near-optimal sample complexity with sufficient coverage.

03

Theoretical insights clarify the roles of KL-regularization and data coverage in RLHF.

Abstract

Reverse-Kullback-Leibler (KL) regularization has emerged to be a predominant technique used to enhance policy optimization in reinforcement learning (RL) and reinforcement learning from human feedback (RLHF), which forces the learned policy to stay close to a reference policy. While the effectiveness and necessity of KL-regularization have been empirically demonstrated in various practical scenarios, current theoretical analysis of KL-regularized RLHF still obtains the same $O (1/ ϵ^{2})$ sample complexity as problems without KL-regularization. To understand the fundamental distinction between policy learning objectives with KL-regularization and ones without KL-regularization, we are the first to theoretically demonstrate the power of KL-regularization by providing a sharp analysis for KL-regularized contextual bandits and RLHF, revealing an $O (1/ ϵ)$ …

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Distributed Sensor Networks and Detection Algorithms