Stochastic Bandit Based on Empirical Moments
Junya Honda, Akimichi Takemura

TL;DR
This paper introduces a generalized stochastic bandit policy that leverages empirical moments up to a fixed order to optimize the exploration-exploitation tradeoff, approaching theoretical regret bounds with adjustable complexity.
Contribution
It extends existing variance-based policies to use higher-order empirical moments, balancing computational complexity and regret minimization.
Findings
Asymptotic regret approaches theoretical bounds with increasing moments d.
Policy effectively balances computational complexity and regret by choosing d.
Generalizes variance-based methods to higher moments for improved performance.
Abstract
In the multiarmed bandit problem a gambler chooses an arm of a slot machine to pull considering a tradeoff between exploration and exploitation. We study the stochastic bandit problem where each arm has a reward distribution supported in a known bounded interval, e.g. [0,1]. For this model, policies which take into account the empirical variances (i.e. second moments) of the arms are known to perform effectively. In this paper, we generalize this idea and we propose a policy which exploits the first d empirical moments for arbitrary d fixed in advance. The asymptotic upper bound of the regret of the policy approaches the theoretical bound by Burnetas and Katehakis as d increases. By choosing appropriate d, the proposed policy realizes a tradeoff between the computational complexity and the expected regret.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Auction Theory and Applications · Reinforcement Learning in Robotics
