Reward Imputation with Sketching for Contextual Batched Bandits
Xiao Zhang, Ninglu Shao, Zihua Si, Jun Xu, Wenhan Wang, Hanjing Su,, Ji-Rong Wen

TL;DR
This paper introduces SPUIR, a sketching-based reward imputation method for contextual batched bandits that improves feedback utilization, reduces regret, and outperforms existing methods in various datasets.
Contribution
The paper proposes a novel sketching-based reward imputation approach for CBB, with theoretical regret guarantees and practical extensions for nonlinear rewards.
Findings
SPUIR achieves lower regret than baseline methods.
The approach is effective on synthetic and real datasets.
Extensions improve practicality and applicability.
Abstract
Contextual batched bandit (CBB) is a setting where a batch of rewards is observed from the environment at the end of each episode, but the rewards of the non-executed actions are unobserved, resulting in partial-information feedback. Existing approaches for CBB often ignore the rewards of the non-executed actions, leading to underutilization of feedback information. In this paper, we propose an efficient approach called Sketched Policy Updating with Imputed Rewards (SPUIR) that completes the unobserved rewards using sketching, which approximates the full-information feedbacks. We formulate reward imputation as an imputation regularized ridge regression problem that captures the feedback mechanisms of both executed and non-executed actions. To reduce time complexity, we solve the regression problem using randomized sketching. We prove that our approach achieves an instantaneous regret…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Mobile Crowdsensing and Crowdsourcing · Recommender Systems and Techniques
