Towards a Sharp Analysis of Offline Policy Learning for $f$-Divergence-Regularized Contextual Bandits

Qingyue Zhao; Kaixuan Ji; Heyang Zhao; Tong Zhang; Quanquan Gu

arXiv:2502.06051·cs.LG·February 27, 2026

Towards a Sharp Analysis of Offline Policy Learning for $f$-Divergence-Regularized Contextual Bandits

Qingyue Zhao, Kaixuan Ji, Heyang Zhao, Tong Zhang, Quanquan Gu

PDF

Open Access 3 Reviews

TL;DR

This paper provides a detailed analysis of offline policy learning in contextual bandits with $f$-divergence regularization, establishing tight sample complexity bounds under various conditions and highlighting the importance of concentrability assumptions.

Contribution

It introduces a novel analysis for reverse KL divergence achieving optimal sample complexity under single-policy concentrability, and extends results to strongly convex $f$-divergences without pessimism.

Findings

01

Achieves $ ilde{O}(rac{1}{\, ext{epsilon}})$ sample complexity for reverse KL under single-policy concentrability.

02

Proposes a lower bound showing the necessity of single-policy concentrability dependency.

03

Extends analysis to strongly convex $f$-divergences, achieving sharp sample complexity without pessimism.

Abstract

Many offline reinforcement learning algorithms are underpinned by $f$ -divergence regularization, but their sample complexity *defined with respect to regularized objectives* still lacks tight analyses, especially in terms of concrete data coverage conditions. In this paper, we study the exact concentrability requirements to achieve the $\tilde{Θ} (ϵ^{- 1})$ sample complexity for offline $f$ -divergence-regularized contextual bandits. For reverse Kullback-Leibler (KL) divergence, arguably the most commonly used one, we achieve an $\tilde{O} (ϵ^{- 1})$ sample complexity under single-policy concentrability for the first time via a novel pessimism-based analysis, surpassing existing $\tilde{O} (ϵ^{- 1})$ bound under all-policy concentrability and $\tilde{O} (ϵ^{- 2})$ bound under single-policy concentrability. We also propose a near-matching lower bound,…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1. Relaxation from all policy coverage to optimal policy coverage assumption with clever use of pessimistic bonus term 2. The lower bound of Theorem 2.11 has a multiplicative coverage term, showing some coverage assumption is needed for any efficient algorithm 3. The work shows that for $f$-divergence-regularized objectives, if $f$ is strongly convex, then no coverage assumption on the reference policy is necessary.

Weaknesses

1. The worst-case gap between $C^{\pi^\*}$ and $D^2_{\pi^\*}$ scales as $|S||A|$. This linear dependence on the size of the state space can render Algorithm 1 inefficient, even when $C^{\pi^\*}$ is a constant. A more detailed discussion or empirical illustration of how these two coverage measures relate in practice would strengthen the paper. 2. Theorem 3.4 constructs only a specific instance of an $\alpha$-strongly convex $f$ (a scaled $\chi^2$ divergence) that matches the upper bound, rather t

Reviewer 02Rating 6Confidence 2

Strengths

1. The paper is well-written. The story is clear and consistent. 2. The contributions should be solid. a) The f-divergence could provide another way to understand the requirement for offline learning. b) The refined mean-value-type risk upper bound has some technical improvements.

Weaknesses

1. The results in the f-divergence algorithm are not compared with the global optimal policy, which is not the same as the KL-divergence. In other words, it seems to replace the coverage assumption with another assumption that the global optimal policy is the optimal policy they defined in 3.1. It is still an important contribution, but it may be overstated. 2. The numerical experiments are very simple, more like a sanity check rather than a validation.

Reviewer 03Rating 6Confidence 2

Strengths

1. This paper established a near-optimal sample complexity for offline regularized bandits. 2. The moment-based argument and integration of pessimism with curvature properties seem novel in offline bandits. 3. The extension to strongly convex f-divergences provides a unified theoretical view.

Weaknesses

1. Experiments are limited. Only toy two-armed-bandit cases are shown, perhaps including some real dataset would strengthen the arguments. 2. The paper seems to put more effort on introducing the analysis for KL-divergence regularized Bandits, while the title suggest general f-divergence. Perhaps including discussions on how to handle general f-divergence regularized bandits would make the main body match with the title.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Distributed Sensor Networks and Detection Algorithms · Cognitive Radio Networks and Spectrum Sensing