Online learning in bandits with predicted context
Yongyi Guo, Ziping Xu, Susan Murphy

TL;DR
This paper introduces an online algorithm for contextual bandits that effectively handles noisy context observations with known error variance, achieving sublinear regret where classical methods fail.
Contribution
It presents the first online algorithm with sublinear regret guarantees for bandits with noisy, predicted contexts, extending measurement error models to online decision-making.
Findings
Algorithm achieves sublinear regret in noisy context settings.
Demonstrates effectiveness on synthetic and real datasets.
Extends classical measurement error models to online learning.
Abstract
We consider the contextual bandit problem where at each time, the agent only has access to a noisy version of the context and the error variance (or an estimator of this variance). This setting is motivated by a wide range of applications where the true context for decision-making is unobserved, and only a prediction of the context by a potentially complex machine learning algorithm is available. When the context error is non-vanishing, classical bandit algorithms fail to achieve sublinear regret. We propose the first online algorithm in this setting with sublinear regret guarantees under mild conditions. The key idea is to extend the measurement error model in classical statistics to the online decision-making setting, which is nontrivial due to the policy being dependent on the noisy context observations. We further demonstrate the benefits of the proposed approach in simulation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Smart Grid Energy Management · Reinforcement Learning in Robotics
Methodsfail
