Average-reward reinforcement learning in semi-Markov decision processes via relative value iteration

Huizhen Yu; Yi Wan; Richard S. Sutton

arXiv:2512.06218·cs.LG·December 9, 2025

Average-reward reinforcement learning in semi-Markov decision processes via relative value iteration

Huizhen Yu, Yi Wan, Richard S. Sutton

PDF

Open Access

TL;DR

This paper proves the convergence of an asynchronous stochastic approximation version of relative value iteration for average-reward semi-Markov decision processes, expanding the theoretical understanding of RVI Q-learning.

Contribution

It establishes the convergence of an asynchronous RVI Q-learning algorithm for SMDPs and introduces new monotonicity conditions for estimating the optimal reward rate.

Findings

01

Proves almost sure convergence of the algorithm.

02

Shows convergence to a solution set of the optimality equation.

03

Introduces new stability and monotonicity conditions.

Abstract

This paper applies the authors' recent results on asynchronous stochastic approximation (SA) in the Borkar-Meyn framework to reinforcement learning in average-reward semi-Markov decision processes (SMDPs). We establish the convergence of an asynchronous SA analogue of Schweitzer's classical relative value iteration algorithm, RVI Q-learning, for finite-space, weakly communicating SMDPs. In particular, we show that the algorithm converges almost surely to a compact, connected subset of solutions to the average-reward optimality equation, with convergence to a unique, sample path-dependent solution under additional stepsize and asynchrony conditions. Moreover, to make full use of the SA framework, we introduce new monotonicity conditions for estimating the optimal reward rate in RVI Q-learning. These conditions substantially expand the previously considered algorithmic framework and are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Age of Information Optimization · Adaptive Dynamic Programming Control