Online Learning with Off-Policy Feedback
Germano Gabbianelli, Matteo Papini, Gergely Neu

TL;DR
This paper introduces algorithms for online learning with off-policy feedback, addressing partial observability and limited reward information, and extends the approach to adversarial linear contextual bandits with theoretical guarantees and experiments.
Contribution
It proposes new algorithms with regret bounds that adapt to the mismatch between policies, improving off-policy learning in adversarial bandit settings.
Findings
Regret bounds scale with policy mismatch.
Algorithms perform well under limited observations.
Theoretical guarantees are validated by experiments.
Abstract
We study the problem of online learning in adversarial bandit problems under a partial observability model called off-policy feedback. In this sequential decision making problem, the learner cannot directly observe its rewards, but instead sees the ones obtained by another unknown policy run in parallel (behavior policy). Instead of a standard exploration-exploitation dilemma, the learner has to face another challenge in this setting: due to limited observations outside of their control, the learner may not be able to estimate the value of each policy equally well. To address this issue, we propose a set of algorithms that guarantee regret bounds that scale with a natural notion of mismatch between any comparator policy and the behavior policy, achieving improved performance against comparators that are well-covered by the observations. We also provide an extension to the setting of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Receptor Mechanisms and Signaling · Reinforcement Learning in Robotics
