Chasing Ghosts: Competing with Stateful Policies
Uriel Feige, Tomer Koren, Moshe Tennenholtz

TL;DR
This paper studies sequential decision making with stateful reference policies under bandit feedback, proposing an algorithm with sublinear regret and establishing a lower bound, addressing challenges of tracking internal states.
Contribution
The paper introduces a novel algorithm for regret minimization in stateful policy settings with bandit feedback, and proves a new regret lower bound.
Findings
Proposed algorithm achieves regret of O(T / log^{1/4} T).
Lower bound on regret is established at O(T / log^{3/2} T).
Addresses the challenge of unobservable internal states of policies.
Abstract
We consider sequential decision making in a setting where regret is measured with respect to a set of stateful reference policies, and feedback is limited to observing the rewards of the actions performed (the so called "bandit" setting). If either the reference policies are stateless rather than stateful, or the feedback includes the rewards of all actions (the so called "expert" setting), previous work shows that the optimal regret grows like in terms of the number of decision rounds . The difficulty in our setting is that the decision maker unavoidably loses track of the internal states of the reference policies, and thus cannot reliably attribute rewards observed in a certain round to any of the reference policies. In fact, in this setting it is impossible for the algorithm to estimate which policy gives the highest (or even approximately highest) total…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Machine Learning and Algorithms · Optimization and Search Problems
