On Convergence of Average-Reward Q-Learning in Weakly Communicating Markov Decision Processes
Yi Wan, Huizhen Yu, Richard S. Sutton

TL;DR
This paper extends the convergence analysis of RVI Q-learning algorithms to weakly communicating MDPs, broadening theoretical understanding and practical applicability in average-reward reinforcement learning.
Contribution
It generalizes convergence proofs from unichain to weakly communicating MDPs and characterizes the convergence sets for RVI Q-learning algorithms.
Findings
RVI Q-learning converges almost surely in weakly communicating MDPs.
The convergence set is compact, connected, and solutions to the average-reward optimality equation.
Hierarchical RVI algorithms also converge under weakly communicating assumptions.
Abstract
This paper analyzes reinforcement learning (RL) algorithms for Markov decision processes (MDPs) under the average-reward criterion. We focus on Q-learning algorithms based on relative value iteration (RVI), which are model-free stochastic analogues of the classical RVI method for average-reward MDPs. These algorithms have low per-iteration complexity, making them well-suited for large state space problems. We extend the almost-sure convergence analysis of RVI Q-learning algorithms developed by Abounadi, Bertsekas, and Borkar (2001) from unichain to weakly communicating MDPs. This extension is important both practically and theoretically: weakly communicating MDPs cover a much broader range of applications compared to unichain MDPs, and their optimality equations have a richer solution structure (with multiple degrees of freedom), introducing additional complexity in proving algorithmic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Modeling and Causal Inference · Fault Detection and Control Systems · Neural Networks and Applications
MethodsSparse Evolutionary Training · Q-Learning · Focus
