Contextual Linear Bandits with Delay as Payoff
Mengxiao Zhang, Yingfei Wang, Haipeng Luo

TL;DR
This paper extends the delay-as-payoff model to contextual linear bandits, proposing an efficient phased elimination algorithm with regret bounds that handle delays proportional to payoffs, and demonstrates its effectiveness through experiments.
Contribution
It introduces a novel phased arm elimination algorithm for contextual linear bandits with delay-as-payoff, achieving near-optimal regret bounds and extending to varying action sets.
Findings
Regret overhead is at most DΔ_max log T compared to no-delay case.
Further improvements are shown for the loss setting, indicating a separation from reward.
Experimental results demonstrate the algorithm's effectiveness and superior performance.
Abstract
A recent work by Schlisselberg et al. (2024) studies a delay-as-payoff model for stochastic multi-armed bandits, where the payoff (either loss or reward) is delayed for a period that is proportional to the payoff itself. While this captures many real-world applications, the simple multi-armed bandit setting limits the practicality of their results. In this paper, we address this limitation by studying the delay-as-payoff model for contextual linear bandits. Specifically, we start from the case with a fixed action set and propose an efficient algorithm whose regret overhead compared to the standard no-delay case is at most , where is the total horizon, is the maximum delay, and is the maximum suboptimality gap. When payoff is loss, we also show further improvement of the bound, demonstrating a separation between reward and loss similar to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Decision-Making and Behavioral Economics · Auction Theory and Applications
MethodsSparse Evolutionary Training
