Stochastic differential equations for limiting description of UCB rule for Gaussian multi-armed bandits
Sergey Garbar

TL;DR
This paper develops a stochastic differential equation framework to describe the limiting behavior of the UCB algorithm in Gaussian multi-armed bandits, validated through Monte Carlo simulations.
Contribution
It introduces a novel stochastic differential equation-based model for the UCB strategy in Gaussian bandits with known horizons, extending understanding of its asymptotic properties.
Findings
The model accurately predicts the normalized regret in close reward distributions.
Monte Carlo simulations confirm the validity of the stochastic differential equation description.
Estimated minimal horizon size for near-optimal normalized regret.
Abstract
We consider the upper confidence bound strategy for Gaussian multi-armed bandits with known control horizon sizes and build its limiting description with a system of stochastic differential equations and ordinary differential equations. Rewards for the arms are assumed to have unknown expected values and known variances. A set of Monte-Carlo simulations was performed for the case of close distributions of rewards, when mean rewards differ by the magnitude of order , as it yields the highest normalized regret, to verify the validity of the obtained description. The minimal size of the control horizon when the normalized regret is not noticeably larger than maximum possible was estimated.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Forecasting Techniques and Applications · Distributed Sensor Networks and Detection Algorithms
