A note on the price of bandit feedback for mistake-bounded online learning
Jesse Geneson

TL;DR
This paper examines the cost of bandit feedback in mistake-bounded online learning, correcting a false lemma in prior work and establishing a more accurate understanding of the relationship between standard and bandit models.
Contribution
The authors identify and fix a false lemma in previous theoretical results, providing a corrected proof regarding mistake bounds in bandit feedback models.
Findings
The false lemma holds when vectors are multiples mod p.
A new lemma corrects the mistake in prior proof.
The corrected proof clarifies the price of bandit feedback.
Abstract
The standard model and the bandit model are two generalizations of the mistake-bound model to online multiclass classification. In both models the learner guesses a classification in each round, but in the standard model the learner recieves the correct classification after each guess, while in the bandit model the learner is only told whether or not their guess is correct in each round. For any set of multiclass classifiers, define and to be the optimal worst-case number of prediction mistakes in the standard and bandit models respectively. Long (Theoretical Computer Science, 2020) claimed that for all and infinitely many , there exists a set of functions from a set to a set of size such that and . The proof of this result depended on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
