Kiefer Wolfowitz Algorithm is Asymptotically Optimal for a Class of Non-Stationary Bandit Problems
Rahul Singh, Taposh Banerjee

TL;DR
This paper demonstrates that the Kiefer Wolfowitz algorithm, with appropriate modifications, achieves asymptotic optimality in non-stationary bandit problems where the reward functions change slowly over time.
Contribution
It introduces and analyzes two variants of the Kiefer Wolfowitz algorithm for non-stationary bandits, proving their asymptotic efficiency under certain conditions.
Findings
Regret of the algorithms is o(T) when the number of function variations is o(T).
Optimal learning rates lead to asymptotic efficiency.
Algorithms adapt effectively to slowly changing reward functions.
Abstract
We consider the problem of designing an allocation rule or an "online learning algorithm" for a class of bandit problems in which the set of control actions available at each time is a convex, compact subset of . Upon choosing an action at time , the algorithm obtains a noisy value of the unknown and time-varying function evaluated at . The "regret" of an algorithm is the gap between its expected reward, and the reward earned by a strategy which has the knowledge of the function at each time and hence chooses the action that maximizes . For this non-stationary bandit problem set-up, we consider two variants of the Kiefer Wolfowitz (KW) algorithm i) KW with fixed step-size , and ii) KW with sliding window of length . We show that if the number of times that the function varies during time is , and if the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Smart Grid Energy Management · Optimization and Search Problems
