Improved Variance-Aware Confidence Sets for Linear Bandits and Linear Mixture MDP
Zihan Zhang, Jiaqi Yang, Xiangyang Ji, Simon S. Du

TL;DR
This paper introduces variance-aware confidence sets for linear bandits and linear mixture MDPs, leading to regret bounds that adapt to unknown variances and improve existing results, especially in terms of dependence on horizon and episode count.
Contribution
The paper develops novel variance-aware confidence sets and technical tools, achieving the first regret bounds that scale only with variance and logarithmically with horizon, resolving open problems.
Findings
Regret bound for linear bandits scales with variance and dimension, independent of K.
Regret bound for linear mixture MDPs scales logarithmically with horizon H.
New technical tools include a variance recursion estimator and a convex potential lemma.
Abstract
This paper presents new \emph{variance-aware} confidence sets for linear bandits and linear mixture Markov Decision Processes (MDPs). With the new confidence sets, we obtain the follow regret bounds: For linear bandits, we obtain an data-dependent regret bound, where is the feature dimension, is the number of rounds, and is the \emph{unknown} variance of the reward at the -th round. This is the first regret bound that only scales with the variance and the dimension but \emph{no explicit polynomial dependency on }. When variances are small, this bound can be significantly smaller than the worst-case regret bound. For linear mixture MDPs, we obtain an regret bound, where is the number of base models, is the number of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Machine Learning and Algorithms
