Non-Stationary Bandits under Recharging Payoffs: Improved Planning with Sublinear Regret
Orestis Papadigenopoulos, Constantine Caramanis, Sanjay Shakkottai

TL;DR
This paper introduces an improved polynomial-time approximation algorithm for planning in non-stationary multi-armed bandits with recharging payoffs, achieving sublinear regret in the unknown payoff setting.
Contribution
It develops a $(1 - 1/e)$-approximation algorithm for known payoff functions and extends it to the bandit setting with sublinear regret, improving prior guarantees.
Findings
Achieves a $(1 - 1/e)$-approximation for the planning problem.
Provides a bandit algorithm with sublinear regret.
Improves upon the previous 1/4-approximation guarantees.
Abstract
The stochastic multi-armed bandit setting has been recently studied in the non-stationary regime, where the mean payoff of each action is a non-decreasing function of the number of rounds passed since it was last played. This model captures natural behavioral aspects of the users which crucially determine the performance of recommendation platforms, ad placement systems, and more. Even assuming prior knowledge of the mean payoff functions, computing an optimal planning in the above model is NP-hard, while the state-of-the-art is a -approximation algorithm for the case where at most one arm can be played per round. We first focus on the setting where the mean payoff functions are known. In this setting, we significantly improve the best-known guarantees for the planning problem by developing a polynomial-time -approximation algorithm (asymptotically and in expectation),…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Auction Theory and Applications · Age of Information Optimization
