Fast Rates for Offline Contextual Bandits with Forward-KL Regularization under Single-Policy Concentrability
Qingyue Zhao, Kaixuan Ji, Heyang Zhao, Quanquan Gu

TL;DR
This paper establishes the first $ ilde{O}(rac{1}{ ext{epsilon}})$ sample complexity bounds for offline contextual bandits with forward-KL regularization, improving upon previous $ ilde{O}(rac{1}{ ext{epsilon}^2})$ rates.
Contribution
It introduces a novel convex-analytical approach for analyzing forward-KL regularized offline contextual bandits under single-policy concentrability, achieving tight bounds.
Findings
First $ ilde{O}(rac{1}{ ext{epsilon}})$ upper bounds for forward-KL regularized offline CBs.
Unified analysis framework that bypasses previous proof routines based on the mean value theorem.
Rate-optimal lower bounds demonstrating the tightness of the upper bounds.
Abstract
\emph{Kullback-Leibler} (KL) regularization is ubiquitous in reinforcement learning algorithms in the form of \emph{reverse} or \emph{forward} KL. Recent studies have demonstrated -type fast rates for decision making under reverse KL regularization, in contrast to the standard -type sample complexity. However, for forward-KL-regularized objectives, existing statistical analyses are either not applicable or result in slow rates. We take the first step towards addressing this problem via a streamlined analysis of forward-KL-regularized offline CBs. We give the first upper bounds in tabular and general function approximation settings, both under notions of \emph{single-policy concentrability}. In particular, our convex-analytical pipeline unifies these settings by exploiting the pessimism principle in a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
