Fighting Bandits with a New Kind of Smoothness
Jacob Abernethy, Chansoo Lee, Ambuj Tewari

TL;DR
This paper introduces a new family of algorithms for adversarial multi-armed bandits using convex smoothing, demonstrating near-optimal regret bounds with various regularization and perturbation methods.
Contribution
It presents a novel analysis technique and shows that Tsallis entropy regularization and certain perturbation methods achieve optimal or near-optimal regret bounds.
Findings
Tsallis entropy regularization achieves $ ilde{O}( oot{T}{N})$ regret.
Perturbation methods with bounded hazard rate attain $O( oot{T}{N} ext{log}N)$ regret.
Various distributions like Gumbel and Weibull satisfy the bounded hazard rate condition.
Abstract
We define a novel family of algorithms for the adversarial multi-armed bandit problem, and provide a simple analysis technique based on convex smoothing. We prove two main results. First, we show that regularization via the \emph{Tsallis entropy}, which includes EXP3 as a special case, achieves the minimax regret. Second, we show that a wide class of perturbation methods achieve a near-optimal regret as low as if the perturbation distribution has a bounded hazard rate. For example, the Gumbel, Weibull, Frechet, Pareto, and Gamma distributions all satisfy this key property.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Gaussian Processes and Bayesian Inference
