Online Learning with Off-Policy Feedback

Germano Gabbianelli; Matteo Papini; Gergely Neu

arXiv:2207.08956·cs.LG·July 20, 2022

Online Learning with Off-Policy Feedback

Germano Gabbianelli, Matteo Papini, Gergely Neu

PDF

Open Access

TL;DR

This paper introduces algorithms for online learning with off-policy feedback, addressing partial observability and limited reward information, and extends the approach to adversarial linear contextual bandits with theoretical guarantees and experiments.

Contribution

It proposes new algorithms with regret bounds that adapt to the mismatch between policies, improving off-policy learning in adversarial bandit settings.

Findings

01

Regret bounds scale with policy mismatch.

02

Algorithms perform well under limited observations.

03

Theoretical guarantees are validated by experiments.

Abstract

We study the problem of online learning in adversarial bandit problems under a partial observability model called off-policy feedback. In this sequential decision making problem, the learner cannot directly observe its rewards, but instead sees the ones obtained by another unknown policy run in parallel (behavior policy). Instead of a standard exploration-exploitation dilemma, the learner has to face another challenge in this setting: due to limited observations outside of their control, the learner may not be able to estimate the value of each policy equally well. To address this issue, we propose a set of algorithms that guarantee regret bounds that scale with a natural notion of mismatch between any comparator policy and the behavior policy, achieving improved performance against comparators that are well-covered by the observations. We also provide an extension to the setting of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Receptor Mechanisms and Signaling · Reinforcement Learning in Robotics