Regret and Sample Complexity of Online Q-Learning via Concentration of Stochastic Approximation with Time-Inhomogeneous Markov Chains
Rahul Singh, Siddharth Chandak, Eric Moulines, Vivek S. Borkar, Nicholas Bambos

TL;DR
This paper establishes the first regret bounds for classical online Q-learning in infinite-horizon discounted MDPs without optimism, introducing new exploration schemes and concentration bounds for stochastic approximation.
Contribution
It provides the first regret analysis for classical online Q-learning without optimism, introduces a gap-robust exploration scheme, and develops a novel concentration bound for Markovian stochastic approximation.
Findings
Regret depends on the MDP's suboptimality gap, with sublinear regret for large gaps.
A gap-robust regret bound of near- is achieved using a combined -greedy and Boltzmann exploration.
High-probability sample complexity bounds are established for the proposed algorithms.
Abstract
We present the first regret bound for classical online Q-learning in infinite-horizon discounted Markov decision processes (MDPs), without relying on optimism or bonus terms. We first analyze Boltzmann Q-learning with decaying temperature and show that its regret depends critically on the suboptimality gap of the MDP: for sufficiently large gaps, the regret is sublinear, while for small gaps it deteriorates and can approach linear growth. To address this limitation, we study a Smoothed -Greedy exploration scheme that combines -greedy and Boltzmann exploration, for which we prove a gap-robust regret bound of near-. We also obtain sample complexity guarantees, with both regret and sample complexity bounds holding with high probability. To analyze these algorithms, we develop a high-probability concentration bound for contractive Markovian…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Stochastic Gradient Optimization Techniques
