Minimax Policy for Heavy-tailed Bandits
Lai Wei, Vaibhav Srivastava

TL;DR
This paper introduces Robust MOSS, a new algorithm for heavy-tailed bandits that achieves optimal worst-case regret and adapts to distributions with finite moments of order 1+epsilon.
Contribution
It extends the minimax policy MOSS to heavy-tailed rewards using saturated empirical means, achieving optimal worst-case regret.
Findings
Robust MOSS matches the lower bound for worst-case regret.
The algorithm maintains distribution-dependent logarithmic regret.
It effectively handles rewards with finite moments of order 1+epsilon.
Abstract
We study the stochastic Multi-Armed Bandit (MAB) problem under worst-case regret and heavy-tailed reward distribution. We modify the minimax policy MOSS for the sub-Gaussian reward distribution by using saturated empirical mean to design a new algorithm called Robust MOSS. We show that if the moment of order for the reward distribution exists, then the refined strategy has a worst-case regret matching the lower bound while maintaining a distribution-dependent logarithm regret.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Machine Learning and Algorithms
