A Simple and Optimal Policy Design with Safety against Heavy-Tailed Risk for Stochastic Bandits
David Simchi-Levi, Zeyu Zheng, Feng Zhu

TL;DR
This paper introduces a new policy for stochastic multi-armed bandits that achieves optimal worst-case expected regret and provides strong tail bounds on regret distribution, balancing exploration and exploitation effectively.
Contribution
The paper proposes a novel policy that is both worst-case optimal for expected regret and has the best possible tail probability bounds, with extensions to unknown horizon and linear bandits.
Findings
Achieves $O(\sqrt{KT\ln T})$ expected regret bound.
Provides exponential tail bounds on regret distribution.
Outperforms existing policies in tail risk and hyper-parameter tuning.
Abstract
We study the stochastic multi-armed bandit problem and design new policies that enjoy both worst-case optimality for expected regret and light-tailed risk for regret distribution. Specifically, our policy design (i) enjoys the worst-case optimality for the expected regret at order and (ii) has the worst-case tail probability of incurring a regret larger than any being upper bounded by , a rate that we prove to be best achievable with respect to for all worst-case optimal policies. Our proposed policy achieves a delicate balance between doing more exploration at the beginning of the time horizon and doing more exploitation when approaching the end, compared to standard confidence-bound-based policies. We also enhance the policy design to accommodate the "any-time" setting where is unknown a priori, and prove equivalently…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Risk and Portfolio Optimization · Age of Information Optimization
