An Adiabatic Theorem for Policy Tracking with TD-learning

Neil Walton

arXiv:2010.12848·cs.LG·November 3, 2020

An Adiabatic Theorem for Policy Tracking with TD-learning

Neil Walton

PDF

Open Access

TL;DR

This paper introduces an adiabatic theorem to analyze how effectively temporal difference learning can track evolving reward functions in changing policies, providing finite-time bounds for key RL algorithms.

Contribution

It develops a novel adiabatic theorem for time-inhomogeneous Markov chains and derives finite-time bounds for TD and Q-learning with changing policies.

Findings

01

Finite-time bounds for TD and Q-learning under policy changes

02

A new adiabatic theorem bounding mixing times of inhomogeneous Markov chains

03

Bounds for stochastic approximation with asynchronous updates

Abstract

We evaluate the ability of temporal difference learning to track the reward function of a policy as it changes over time. Our results apply a new adiabatic theorem that bounds the mixing time of time-inhomogeneous Markov chains. We derive finite-time bounds for tabular temporal difference learning and $Q$ -learning when the policy used for training changes in time. To achieve this, we develop bounds for stochastic approximation under asynchronous adiabatic updates.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Advanced Control Systems Optimization