Stochastic Contextual Bandits with Long Horizon Rewards
Yuzhen Qin, Yingcong Li, Fabio Pasqualetti, Maryam Fazel, Samet Oymak

TL;DR
This paper develops new algorithms for stochastic contextual linear bandits with rewards depending on sparse past actions over long horizons, achieving regret bounds that avoid polynomial dependence on the horizon length.
Contribution
It introduces algorithms leveraging sparsity to handle long-range dependencies in contextual bandits, with regret bounds applicable in both data-poor and data-rich regimes.
Findings
Regret bounds of O(d")sT + min\u007b q, T and O(")sdT for different regimes.
Learning over a single trajectory is inherently challenging due to long-range dependencies.
New analysis techniques connect circulant matrices' properties to sample complexity in dependent data settings.
Abstract
The growing interest in complex decision-making and language modeling problems highlights the importance of sample-efficient learning over very long horizons. This work takes a step in this direction by investigating contextual linear bandits where the current reward depends on at most prior actions and contexts (not necessarily consecutive), up to a time horizon of . In order to avoid polynomial dependence on , we propose new algorithms that leverage sparsity to discover the dependence pattern and arm parameters jointly. We consider both the data-poor () and data-rich () regimes, and derive respective regret upper bounds and , with sparsity , feature dimension , total time horizon , and that is adaptive to the reward dependence pattern. Complementing upper bounds, we also show that learning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Recommender Systems and Techniques · Bayesian Modeling and Causal Inference
