Extended UCB Policies for Multi-armed Bandit Problems
Keqin Liu, Tianshuo Zheng, Zhi-Hua Zhou

TL;DR
This paper extends UCB policies to handle heavy-tailed reward distributions in multi-armed bandit problems, achieving near-optimal regret without prior distribution knowledge, broadening practical applicability.
Contribution
It generalizes existing UCB policies to arbitrary moments, enabling effective handling of heavy-tailed rewards with minimal distribution assumptions.
Findings
Achieves optimal regret growth order $O( ext{log } T)$ for heavy-tailed rewards.
Extends UCB policies to arbitrary moments $p>q>1$ with known relationships.
Maintains near-optimal regret without prior distribution knowledge.
Abstract
The multi-armed bandit (MAB) problems are widely studied in fields of operations research, stochastic optimization, and reinforcement learning. In this paper, we consider the classical MAB model with heavy-tailed reward distributions and introduce the extended robust UCB policy, which is an extension of the results of Bubeck et al. [5] and Lattimore [22] that are further based on the pioneering idea of UCB policies [e.g. Auer et al. 3]. The previous UCB policies require some strict conditions on reward distributions, which can be difficult to guarantee in practical scenarios. Our extended robust UCB generalizes Lattimore's seminary work (for moments of orders and ) to arbitrarily chosen as long as the two moments have a known controlled relationship, while still achieving the optimal regret growth order , thus providing a broadened application area of UCB…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Optimization and Search Problems
