Regret and Sample Complexity of Online Q-Learning via Concentration of Stochastic Approximation with Time-Inhomogeneous Markov Chains

Rahul Singh; Siddharth Chandak; Eric Moulines; Vivek S. Borkar; Nicholas Bambos

arXiv:2602.16274·cs.LG·May 18, 2026

Regret and Sample Complexity of Online Q-Learning via Concentration of Stochastic Approximation with Time-Inhomogeneous Markov Chains

Rahul Singh, Siddharth Chandak, Eric Moulines, Vivek S. Borkar, Nicholas Bambos

PDF

TL;DR

This paper establishes the first regret bounds for classical online Q-learning in infinite-horizon discounted MDPs without optimism, introducing new exploration schemes and concentration bounds for stochastic approximation.

Contribution

It provides the first regret analysis for classical online Q-learning without optimism, introduces a gap-robust exploration scheme, and develops a novel concentration bound for Markovian stochastic approximation.

Findings

01

Regret depends on the MDP's suboptimality gap, with sublinear regret for large gaps.

02

A gap-robust regret bound of near- is achieved using a combined -greedy and Boltzmann exploration.

03

High-probability sample complexity bounds are established for the proposed algorithms.

Abstract

We present the first regret bound for classical online Q-learning in infinite-horizon discounted Markov decision processes (MDPs), without relying on optimism or bonus terms. We first analyze Boltzmann Q-learning with decaying temperature and show that its regret depends critically on the suboptimality gap of the MDP: for sufficiently large gaps, the regret is sublinear, while for small gaps it deteriorates and can approach linear growth. To address this limitation, we study a Smoothed $ϵ_{n}$ -Greedy exploration scheme that combines $ϵ_{n}$ -greedy and Boltzmann exploration, for which we prove a gap-robust regret bound of near- $\tilde{O} (N^{9/10})$ . We also obtain sample complexity guarantees, with both regret and sample complexity bounds holding with high probability. To analyze these algorithms, we develop a high-probability concentration bound for contractive Markovian…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Stochastic Gradient Optimization Techniques