Randomized Allocation with Nonparametric Estimation for Contextual   Multi-Armed Bandits with Delayed Rewards

Sakshi Arya; Yuhong Yang

arXiv:1902.00819·stat.ML·September 6, 2019·5 cites

Randomized Allocation with Nonparametric Estimation for Contextual Multi-Armed Bandits with Delayed Rewards

Sakshi Arya, Yuhong Yang

PDF

Open Access

TL;DR

This paper addresses the challenge of making optimal arm choices in contextual multi-armed bandits when reward observations are delayed, proposing a randomized strategy that ensures strong consistency under mild assumptions.

Contribution

It introduces a novel randomized allocation method with nonparametric estimation tailored for delayed reward settings in contextual bandits, ensuring strong consistency.

Findings

01

The proposed strategy is strongly consistent under mild assumptions.

02

The method effectively handles delays in reward observation.

03

The approach is applicable to real-world scenarios with delayed feedback.

Abstract

We study a multi-armed bandit problem with covariates in a setting where there is a possible delay in observing the rewards. Under some mild assumptions on the probability distributions for the delays and using an appropriate randomization to select the arms, the proposed strategy is shown to be strongly consistent.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Optimization and Search Problems · Reinforcement Learning in Robotics