Improved Variance-Aware Confidence Sets for Linear Bandits and Linear   Mixture MDP

Zihan Zhang; Jiaqi Yang; Xiangyang Ji; Simon S. Du

arXiv:2101.12745·cs.LG·November 1, 2021

Improved Variance-Aware Confidence Sets for Linear Bandits and Linear Mixture MDP

Zihan Zhang, Jiaqi Yang, Xiangyang Ji, Simon S. Du

PDF

Open Access 1 Video

TL;DR

This paper introduces variance-aware confidence sets for linear bandits and linear mixture MDPs, leading to regret bounds that adapt to unknown variances and improve existing results, especially in terms of dependence on horizon and episode count.

Contribution

The paper develops novel variance-aware confidence sets and technical tools, achieving the first regret bounds that scale only with variance and logarithmically with horizon, resolving open problems.

Findings

01

Regret bound for linear bandits scales with variance and dimension, independent of K.

02

Regret bound for linear mixture MDPs scales logarithmically with horizon H.

03

New technical tools include a variance recursion estimator and a convex potential lemma.

Abstract

This paper presents new \emph{variance-aware} confidence sets for linear bandits and linear mixture Markov Decision Processes (MDPs). With the new confidence sets, we obtain the follow regret bounds: For linear bandits, we obtain an $\tilde{O} (p o l y (d) 1 + \sum_{k = 1}^{K} σ_{k}^{2})$ data-dependent regret bound, where $d$ is the feature dimension, $K$ is the number of rounds, and $σ_{k}^{2}$ is the \emph{unknown} variance of the reward at the $k$ -th round. This is the first regret bound that only scales with the variance and the dimension but \emph{no explicit polynomial dependency on $K$ }. When variances are small, this bound can be significantly smaller than the $\tilde{Θ} (d K)$ worst-case regret bound. For linear mixture MDPs, we obtain an $\tilde{O} (p o l y (d, lo g H) K)$ regret bound, where $d$ is the number of base models, $K$ is the number of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Improved Variance-Aware Confidence Sets for Linear Bandits and Linear Mixture MDP· slideslive

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Machine Learning and Algorithms