Bandit Allocational Instability
Yilun Chen, Jiaqi Lu

TL;DR
This paper introduces a new metric called allocation variability for multi-armed bandit algorithms, revealing a fundamental trade-off with regret and providing bounds and an algorithm that achieve near-optimal performance.
Contribution
It establishes a fundamental trade-off between allocation variability and regret in bandit algorithms, introduces a tunable algorithm UCB-f, and resolves an open question in the field.
Findings
Worst-case regret and allocation variability satisfy R_T * S_T=Ω(T^{3/2})
Any sublinear regret algorithm must have S_T=ω(√T)
UCB-f achieves the Pareto optimal trade-off
Abstract
When multi-armed bandit (MAB) algorithms allocate pulls among competing arms, the resulting allocation can exhibit huge variation. This is particularly harmful in modern applications such as learning-enhanced platform operations and post-bandit statistical inference. Thus motivated, we introduce a new performance metric of MAB algorithms termed allocation variability, which is the largest (over arms) standard deviation of an arm's number of pulls. We establish a fundamental trade-off between allocation variability and regret, the canonical performance metric of reward maximization. In particular, for any algorithm, the worst-case regret and worst-case allocation variability must satisfy as , as long as . This indicates that any minimax regret-optimal algorithm must incur worst-case allocation variability…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Stochastic Gradient Optimization Techniques · Machine Learning and Algorithms
