Unified theory of upper confidence bound policies for bandit problems   targeting total reward, maximal reward, and more

Nobuaki Kikkawa; Hiroshi Ohno

arXiv:2411.00339·stat.ML·November 4, 2024

Unified theory of upper confidence bound policies for bandit problems targeting total reward, maximal reward, and more

Nobuaki Kikkawa, Hiroshi Ohno

PDF

Open Access

TL;DR

This paper unifies the analysis of UCB policies for total reward and max bandit problems, introducing the oracle quantity concept and proposing new algorithms with proven order optimality and practical effectiveness.

Contribution

It provides a unified theoretical framework for UCB policies across different bandit problems, introduces the oracle quantity concept, and proposes PIUCB algorithms for improved performance.

Findings

01

MaxSearch algorithm is order-optimal for max bandit problem.

02

Confidence intervals of the oracle quantity are crucial for UCB optimality.

03

PIUCB algorithms perform comparably or better than MaxSearch in experiments.

Abstract

The upper confidence bound (UCB) policy is recognized as an order-optimal solution for the classical total-reward bandit problem. While similar UCB-based approaches have been applied to the max bandit problem, which aims to maximize the cumulative maximal reward, their order optimality remains unclear. In this study, we clarify the unified conditions under which the UCB policy achieves the order optimality in both total-reward and max bandit problems. A key concept of our theory is the oracle quantity, which identifies the best arm by its highest value. This allows a unified definition of the UCB policy as pulling the arm with the highest UCB of the oracle quantity. Additionally, under this setting, optimality analysis can be conducted by replacing traditional regret with the number of failures as a core measure. One consequence of our analysis is that the confidence interval of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Advanced Causal Inference Techniques