Optimal Regret for Policy Optimization in Contextual Bandits

Orin Levy; Yishay Mansour

arXiv:2602.13700·cs.LG·February 17, 2026

Optimal Regret for Policy Optimization in Contextual Bandits

Orin Levy, Yishay Mansour

PDF

Open Access

TL;DR

This paper establishes the first high-probability optimal regret bound for policy optimization in stochastic contextual bandits with function approximation, bridging theory and practice.

Contribution

It provides the first rigorous optimal regret analysis for policy optimization methods in contextual bandits with general offline function approximation.

Findings

01

Achieves an optimal regret bound of () () log ||)

02

Algorithm is both efficient and theoretically optimal

03

Empirical evaluation supports theoretical claims

Abstract

We present the first high-probability optimal regret bound for a policy optimization technique applied to the problem of stochastic contextual multi-armed bandit (CMAB) with general offline function approximation. Our algorithm is both efficient and achieves an optimal regret bound of $O (K ∣ A ∣ lo g ∣ F ∣)$ , where $K$ is the number of rounds, $A$ is the set of arms, and $F$ is the function class used to approximate the losses. Our results bridge the gap between theory and practice, demonstrating that the widely used policy optimization methods for the contextual bandit problem can achieve a rigorously-proved optimal regret bound. We support our theoretical results with an empirical evaluation of our algorithm.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Risk and Portfolio Optimization · Stochastic Gradient Optimization Techniques