Offline Planning and Online Learning under Recovering Rewards
David Simchi-Levi, Zeyu Zheng, Feng Zhu

TL;DR
This paper introduces a novel class of non-stationary multi-armed bandit problems with recovering rewards, proposing periodic policies for offline and online settings, achieving near-optimal performance guarantees.
Contribution
The paper develops a unified framework for offline planning and online learning in non-stationary bandits with recovering rewards, including new policies with proven performance bounds.
Findings
Offline policy achieves near-optimal approximation ratio of 1 - O(1/√K).
Online policy attains regret of approximately Õ(N√T).
Framework extends to broader applications with non-stationary, recovering rewards.
Abstract
Motivated by emerging applications such as live-streaming e-commerce, promotions and recommendations, we introduce and solve a general class of non-stationary multi-armed bandit problems that have the following two features: (i) the decision maker can pull and collect rewards from up to out of different arms in each time period; (ii) the expected reward of an arm immediately drops after it is pulled, and then non-parametrically recovers as the arm's idle time increases. With the objective of maximizing the expected cumulative reward over time periods, we design a class of ``Purely Periodic Policies'' that jointly set a period to pull each arm. For the proposed policies, we prove performance guarantees for both the offline problem and the online problems. For the offline problem when all model parameters are known, the proposed periodic policy obtains an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Optimization and Search Problems · Age of Information Optimization
