Almost Sure Convergence of Differential Temporal Difference Learning for Average Reward Markov Decision Processes

Ethan Blaser; Jiuqi Wang; Shangtong Zhang

arXiv:2602.16629·cs.LG·February 19, 2026

Almost Sure Convergence of Differential Temporal Difference Learning for Average Reward Markov Decision Processes

Ethan Blaser, Jiuqi Wang, Shangtong Zhang

PDF

Open Access

TL;DR

This paper proves the almost sure convergence of differential TD learning algorithms for average reward Markov Decision Processes using standard diminishing learning rates, removing the need for a local clock and extending theoretical guarantees.

Contribution

It establishes convergence of on-policy and off-policy differential TD algorithms with standard learning rates, enhancing their theoretical foundation and practical relevance.

Findings

01

Proves convergence of on-policy n-step differential TD with standard rates.

02

Provides conditions for off-policy differential TD convergence.

03

Strengthens theoretical understanding of differential TD algorithms.

Abstract

The average reward is a fundamental performance metric in reinforcement learning (RL) focusing on the long-run performance of an agent. Differential temporal difference (TD) learning algorithms are a major advance for average reward RL as they provide an efficient online method to learn the value functions associated with the average reward in both on-policy and off-policy settings. However, existing convergence guarantees require a local clock in learning rates tied to state visit counts, which practitioners do not use and does not extend beyond tabular settings. We address this limitation by proving the almost sure convergence of on-policy $n$ -step differential TD for any $n$ using standard diminishing learning rates without a local clock. We then derive three sufficient conditions under which off-policy $n$ -step differential TD also converges without a local clock. These results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Age of Information Optimization