On the Optimal Sample Complexity of Offline Multi-Armed Bandits with KL Regularization
Kaixuan Ji, Qiwei Di, Heyang Zhao, Qingyue Zhao, Quanquan Gu

TL;DR
This paper characterizes the sample complexity of offline multi-armed bandits with KL regularization, providing matching upper and lower bounds across different regularization regimes.
Contribution
It offers a nearly complete analysis of the sample complexity for KL-regularized offline MABs, including sharp bounds and insights into regularization effects.
Findings
Achieves a sample complexity of () under large regularization.
Achieves a sample complexity of () under small regularization.
Provides matching lower bounds over all regularization strengths.
Abstract
Kullback-Leibler (KL) regularization is widely used in offline decision-making and offers several benefits, motivating recent work on the sample complexity of offline learning with respect to KL-regularized performance metrics. Nevertheless, the exact sample complexity of KL-regularized offline learning remains largely from fully characterized. In this paper, we study this question in the setting of multi-armed bandits (MABs). We provide a sharp analysis of KL-PCB (Zhao et al., 2026), showing that it achieves a sample complexity of under large regularization , and a sample complexity of under small regularization , where is the regularization parameter, is the number of contexts, is the number of arms, policy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
