An Adiabatic Theorem for Policy Tracking with TD-learning
Neil Walton

TL;DR
This paper introduces an adiabatic theorem to analyze how effectively temporal difference learning can track evolving reward functions in changing policies, providing finite-time bounds for key RL algorithms.
Contribution
It develops a novel adiabatic theorem for time-inhomogeneous Markov chains and derives finite-time bounds for TD and Q-learning with changing policies.
Findings
Finite-time bounds for TD and Q-learning under policy changes
A new adiabatic theorem bounding mixing times of inhomogeneous Markov chains
Bounds for stochastic approximation with asynchronous updates
Abstract
We evaluate the ability of temporal difference learning to track the reward function of a policy as it changes over time. Our results apply a new adiabatic theorem that bounds the mixing time of time-inhomogeneous Markov chains. We derive finite-time bounds for tabular temporal difference learning and -learning when the policy used for training changes in time. To achieve this, we develop bounds for stochastic approximation under asynchronous adiabatic updates.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Advanced Control Systems Optimization
