Optimal Variance-Dependent Regret Bounds for Infinite-Horizon MDPs

Guy Zamir; Matthew Zurek; Yudong Chen

arXiv:2603.23926·cs.LG·March 26, 2026

Optimal Variance-Dependent Regret Bounds for Infinite-Horizon MDPs

Guy Zamir, Matthew Zurek, Yudong Chen

PDF

Open Access

TL;DR

This paper introduces a new algorithm for infinite-horizon MDPs that achieves optimal variance-dependent regret bounds, adapts to problem difficulty, and characterizes the impact of prior knowledge on regret minimization.

Contribution

It develops a single UCB-style algorithm with optimal variance-dependent regret guarantees for both average-reward and gamma-regret objectives, and analyzes the role of prior knowledge.

Findings

01

Achieves regret bounds of O((\,SA ext{Var})^{1/2}) in both settings.

02

Provides lower bounds showing the optimality of the dependence on the bias span and state-action space.

03

Demonstrates the algorithm's adaptability to deterministic MDPs with nearly constant regret.

Abstract

Online reinforcement learning in infinite-horizon Markov decision processes (MDPs) remains less theoretically and algorithmically developed than its episodic counterpart, with many algorithms suffering from high ``burn-in'' costs and failing to adapt to benign instance-specific complexity. In this work, we address these shortcomings for two infinite-horizon objectives: the classical average-reward regret and the $γ$ -regret. We develop a single tractable UCB-style algorithm applicable to both settings, which achieves the first optimal variance-dependent regret guarantees. Our regret bounds in both settings take the form $\tilde{O} (S A Var + lower-order terms)$ , where $S, A$ are the state and action space sizes, and $Var$ captures cumulative transition variance. This implies minimax-optimal average-reward and $γ$ -regret bounds in the worst case but…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Age of Information Optimization