Nonstationary Stochastic Multiarmed Bandits: UCB Policies and Minimax   Regret

Lai Wei; Vaibhav Srivastava

arXiv:2101.08980·cs.LG·January 25, 2021·5 cites

Nonstationary Stochastic Multiarmed Bandits: UCB Policies and Minimax Regret

Lai Wei, Vaibhav Srivastava

PDF

Open Access

TL;DR

This paper introduces and analyzes UCB-based algorithms for nonstationary stochastic multi-armed bandit problems, achieving order-optimal minimax regret under variation constraints and handling heavy-tailed rewards.

Contribution

It extends UCB policies with resetting, sliding windows, and discounting to nonstationary environments and develops robust versions for heavy-tailed rewards.

Findings

01

Proposed policies are order-optimal in worst-case regret.

02

Algorithms effectively adapt to reward distribution changes.

03

Robust methods handle heavy-tailed reward distributions.

Abstract

We study the nonstationary stochastic Multi-Armed Bandit (MAB) problem in which the distribution of rewards associated with each arm are assumed to be time-varying and the total variation in the expected rewards is subject to a variation budget. The regret of a policy is defined by the difference in the expected cumulative rewards obtained using the policy and using an oracle that selects the arm with the maximum mean reward at each time. We characterize the performance of the proposed policies in terms of the worst-case regret, which is the supremum of the regret over the set of reward distribution sequences satisfying the variation budget. We extend Upper-Confidence Bound (UCB)-based policies with three different approaches, namely, periodic resetting, sliding observation window and discount factor and show that they are order-optimal with respect to the minimax regret, i.e., the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Age of Information Optimization