Online Bandit Learning against an Adaptive Adversary: from Regret to   Policy Regret

Raman Arora (TTIC); Ofer Dekel (Microsoft Research); Ambuj Tewari; (University of Texas)

arXiv:1206.6400·cs.LG·July 3, 2012·ICML·87 cites

Online Bandit Learning against an Adaptive Adversary: from Regret to Policy Regret

Raman Arora (TTIC), Ofer Dekel (Microsoft Research), Ambuj Tewari, (University of Texas)

PDF

Open Access

TL;DR

This paper introduces the concept of policy regret for online learning against adaptive adversaries, showing limitations and proposing a transformation technique to achieve sublinear policy regret when adversaries have bounded memory.

Contribution

It defines policy regret as a more meaningful measure against adaptive adversaries and provides a method to convert existing regret guarantees into policy regret guarantees under bounded memory adversaries.

Findings

01

No bandit algorithm can guarantee sublinear policy regret against unbounded memory adaptive adversaries.

02

A general technique converts any sublinear regret bandit algorithm into one with sublinear policy regret against bounded memory adversaries.

03

The results extend to switching, internal, and swap regret variants.

Abstract

Online learning algorithms are designed to learn even when their input is generated by an adversary. The widely-accepted formal definition of an online algorithm's ability to learn is the game-theoretic notion of regret. We argue that the standard definition of regret becomes inadequate if the adversary is allowed to adapt to the online algorithm's actions. We define the alternative notion of policy regret, which attempts to provide a more meaningful way to measure an online algorithm's performance against adaptive adversaries. Focusing on the online bandit setting, we show that no bandit algorithm can guarantee a sublinear policy regret against an adaptive adversary with unbounded memory. On the other hand, if the adversary's memory is bounded, we present a general technique that converts any bandit algorithm with a sublinear regret bound into an algorithm with a sublinear policy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Machine Learning and Algorithms · Reinforcement Learning in Robotics