On the Complexity of Bandit Linear Optimization
Ohad Shamir

TL;DR
This paper investigates the regret bounds in bandit linear optimization, revealing that the regret can be significantly larger than in full-information scenarios, with the gap proportional to the dimension d, challenging previous assumptions.
Contribution
It demonstrates that the regret in bandit linear optimization can be as large as d times the full-information regret, using simple domain modifications, which was previously unrecognized.
Findings
Regret can be as large as d times the full-information regret.
Simple domain modifications can drastically increase bandit regret.
Contrasts between bandit and full-information settings are highlighted.
Abstract
We study the attainable regret for online linear optimization problems with bandit feedback, where unlike the full-information setting, the player can only observe its own loss rather than the full loss vector. We show that the price of bandit information in this setting can be as large as , disproving the well-known conjecture that the regret for bandit linear optimization is at most times the full-information regret. Surprisingly, this is shown using "trivial" modifications of standard domains, which have no effect in the full-information setting. This and other results we present highlight some interesting differences between full-information and bandit learning, which were not considered in previous literature.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Optimization and Search Problems · Reinforcement Learning in Robotics
