Randomized Allocation with Nonparametric Estimation for Contextual Multi-Armed Bandits with Delayed Rewards
Sakshi Arya, Yuhong Yang

TL;DR
This paper addresses the challenge of making optimal arm choices in contextual multi-armed bandits when reward observations are delayed, proposing a randomized strategy that ensures strong consistency under mild assumptions.
Contribution
It introduces a novel randomized allocation method with nonparametric estimation tailored for delayed reward settings in contextual bandits, ensuring strong consistency.
Findings
The proposed strategy is strongly consistent under mild assumptions.
The method effectively handles delays in reward observation.
The approach is applicable to real-world scenarios with delayed feedback.
Abstract
We study a multi-armed bandit problem with covariates in a setting where there is a possible delay in observing the rewards. Under some mild assumptions on the probability distributions for the delays and using an appropriate randomization to select the arms, the proposed strategy is shown to be strongly consistent.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Optimization and Search Problems · Reinforcement Learning in Robotics
