Crush Optimism with Pessimism: Structured Bandits Beyond Asymptotic Optimality
Kwang-Sung Jun, Chicheng Zhang

TL;DR
This paper introduces CROP, a novel algorithm for structured bandits that achieves asymptotic optimality and adapts to bounded regret, overcoming limitations of optimistic algorithms and improving performance in finite-time regimes.
Contribution
The paper proposes CROP, the first algorithm to attain asymptotic optimality while adapting to bounded regret in structured bandits, eliminating optimistic hypotheses with a pessimistic approach.
Findings
CROP achieves constant-factor asymptotic optimality.
CROP adapts to bounded regret, scaling with an effective number of arms.
CROP can outperform existing algorithms exponentially in certain nonasymptotic regimes.
Abstract
We study stochastic structured bandits for minimizing regret. The fact that the popular optimistic algorithms do not achieve the asymptotic instance-dependent regret optimality (asymptotic optimality for short) has recently alluded researchers. On the other hand, it is known that one can achieve bounded regret (i.e., does not grow indefinitely with ) in certain instances. Unfortunately, existing asymptotically optimal algorithms rely on forced sampling that introduces an term w.r.t. the time horizon in their regret, failing to adapt to the "easiness" of the instance. In this paper, we focus on the finite hypothesis case and ask if one can achieve the asymptotic optimality while enjoying bounded regret whenever possible. We provide a positive answer by introducing a new algorithm called CRush Optimism with Pessimism (CROP) that eliminates optimistic hypotheses by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Smart Grid Energy Management · Reinforcement Learning in Robotics
