Efficient and Optimal Policy Gradient Algorithm for Corrupted Multi-armed Bandits
Jiayuan Liu, Siwei Wang, Zhixuan Fang

TL;DR
This paper introduces SAMBA, a policy gradient algorithm for corrupted multi-armed bandits, achieving near-optimal regret bounds and outperforming existing methods through theoretical analysis and simulations.
Contribution
The paper proposes SAMBA, a computationally efficient policy gradient algorithm that improves regret bounds for corrupted bandit problems, reducing the logarithmic factor compared to prior algorithms.
Findings
SAMBA achieves a regret bound of O(K log T / Δ + C / Δ).
SAMBA reduces the log T factor in regret compared to CBARBAR.
Simulations show SAMBA outperforms existing algorithms in practice.
Abstract
In this paper, we consider the stochastic multi-armed bandits problem with adversarial corruptions, where the random rewards of the arms are partially modified by an adversary to fool the algorithm. We apply the policy gradient algorithm SAMBA to this setting, and show that it is computationally efficient, and achieves a state-of-the-art regret upper bound, where is the number of arms, is the unknown corruption level, is the minimum expected reward gap between the best arm and other ones, and is the time horizon. Compared with the best existing efficient algorithm (e.g., CBARBAR), whose regret upper bound is , we show that SAMBA reduces one factor in the regret bound, while maintaining the corruption-dependent term to be linear with . This is indeed asymptotically optimal. We also conduct…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Machine Learning and ELM · Optimization and Search Problems
