Optimal Regret for Policy Optimization in Contextual Bandits
Orin Levy, Yishay Mansour

TL;DR
This paper establishes the first high-probability optimal regret bound for policy optimization in stochastic contextual bandits with function approximation, bridging theory and practice.
Contribution
It provides the first rigorous optimal regret analysis for policy optimization methods in contextual bandits with general offline function approximation.
Findings
Achieves an optimal regret bound of () () log ||)
Algorithm is both efficient and theoretically optimal
Empirical evaluation supports theoretical claims
Abstract
We present the first high-probability optimal regret bound for a policy optimization technique applied to the problem of stochastic contextual multi-armed bandit (CMAB) with general offline function approximation. Our algorithm is both efficient and achieves an optimal regret bound of , where is the number of rounds, is the set of arms, and is the function class used to approximate the losses. Our results bridge the gap between theory and practice, demonstrating that the widely used policy optimization methods for the contextual bandit problem can achieve a rigorously-proved optimal regret bound. We support our theoretical results with an empirical evaluation of our algorithm.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Risk and Portfolio Optimization · Stochastic Gradient Optimization Techniques
