Stochastic Contextual Bandits with Long Horizon Rewards

Yuzhen Qin; Yingcong Li; Fabio Pasqualetti; Maryam Fazel; Samet Oymak

arXiv:2302.00814·cs.LG·February 7, 2023·1 cites

Stochastic Contextual Bandits with Long Horizon Rewards

Yuzhen Qin, Yingcong Li, Fabio Pasqualetti, Maryam Fazel, Samet Oymak

PDF

Open Access 1 Video

TL;DR

This paper develops new algorithms for stochastic contextual linear bandits with rewards depending on sparse past actions over long horizons, achieving regret bounds that avoid polynomial dependence on the horizon length.

Contribution

It introduces algorithms leveraging sparsity to handle long-range dependencies in contextual bandits, with regret bounds applicable in both data-poor and data-rich regimes.

Findings

01

Regret bounds of O(d")sT + min\u007b q, T and O(")sdT for different regimes.

02

Learning over a single trajectory is inherently challenging due to long-range dependencies.

03

New analysis techniques connect circulant matrices' properties to sample complexity in dependent data settings.

Abstract

The growing interest in complex decision-making and language modeling problems highlights the importance of sample-efficient learning over very long horizons. This work takes a step in this direction by investigating contextual linear bandits where the current reward depends on at most $s$ prior actions and contexts (not necessarily consecutive), up to a time horizon of $h$ . In order to avoid polynomial dependence on $h$ , we propose new algorithms that leverage sparsity to discover the dependence pattern and arm parameters jointly. We consider both the data-poor ( $T < h$ ) and data-rich ( $T \geq h$ ) regimes, and derive respective regret upper bounds $\tilde{O} (d s T + min {q, T})$ and $\tilde{O} (s d T)$ , with sparsity $s$ , feature dimension $d$ , total time horizon $T$ , and $q$ that is adaptive to the reward dependence pattern. Complementing upper bounds, we also show that learning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Stochastic Contextual Bandits with Long Horizon Rewards· underline

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Recommender Systems and Techniques · Bayesian Modeling and Causal Inference