Optimal Variance-Dependent Regret Bounds for Infinite-Horizon MDPs
Guy Zamir, Matthew Zurek, Yudong Chen

TL;DR
This paper introduces a new algorithm for infinite-horizon MDPs that achieves optimal variance-dependent regret bounds, adapts to problem difficulty, and characterizes the impact of prior knowledge on regret minimization.
Contribution
It develops a single UCB-style algorithm with optimal variance-dependent regret guarantees for both average-reward and gamma-regret objectives, and analyzes the role of prior knowledge.
Findings
Achieves regret bounds of O((\,SA ext{Var})^{1/2}) in both settings.
Provides lower bounds showing the optimality of the dependence on the bias span and state-action space.
Demonstrates the algorithm's adaptability to deterministic MDPs with nearly constant regret.
Abstract
Online reinforcement learning in infinite-horizon Markov decision processes (MDPs) remains less theoretically and algorithmically developed than its episodic counterpart, with many algorithms suffering from high ``burn-in'' costs and failing to adapt to benign instance-specific complexity. In this work, we address these shortcomings for two infinite-horizon objectives: the classical average-reward regret and the -regret. We develop a single tractable UCB-style algorithm applicable to both settings, which achieves the first optimal variance-dependent regret guarantees. Our regret bounds in both settings take the form , where are the state and action space sizes, and captures cumulative transition variance. This implies minimax-optimal average-reward and -regret bounds in the worst case but…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Age of Information Optimization
