Almost Sure Convergence of Differential Temporal Difference Learning for Average Reward Markov Decision Processes
Ethan Blaser, Jiuqi Wang, Shangtong Zhang

TL;DR
This paper proves the almost sure convergence of differential TD learning algorithms for average reward Markov Decision Processes using standard diminishing learning rates, removing the need for a local clock and extending theoretical guarantees.
Contribution
It establishes convergence of on-policy and off-policy differential TD algorithms with standard learning rates, enhancing their theoretical foundation and practical relevance.
Findings
Proves convergence of on-policy n-step differential TD with standard rates.
Provides conditions for off-policy differential TD convergence.
Strengthens theoretical understanding of differential TD algorithms.
Abstract
The average reward is a fundamental performance metric in reinforcement learning (RL) focusing on the long-run performance of an agent. Differential temporal difference (TD) learning algorithms are a major advance for average reward RL as they provide an efficient online method to learn the value functions associated with the average reward in both on-policy and off-policy settings. However, existing convergence guarantees require a local clock in learning rates tied to state visit counts, which practitioners do not use and does not extend beyond tabular settings. We address this limitation by proving the almost sure convergence of on-policy -step differential TD for any using standard diminishing learning rates without a local clock. We then derive three sufficient conditions under which off-policy -step differential TD also converges without a local clock. These results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Age of Information Optimization
