Two Timescale Stochastic Approximation with Controlled Markov noise and   Off-policy temporal difference learning

Prasenjit Karmakar; Shalabh Bhatnagar

arXiv:1503.09105·math.DS·February 28, 2017

Two Timescale Stochastic Approximation with Controlled Markov noise and Off-policy temporal difference learning

Prasenjit Karmakar, Shalabh Bhatnagar

PDF

TL;DR

This paper develops a novel convergence analysis for two time-scale stochastic approximation algorithms driven by controlled Markov noise, and applies it to solve off-policy temporal difference learning with linear function approximation.

Contribution

It introduces the first asymptotic convergence analysis for two time-scale stochastic approximation with controlled Markov noise and addresses off-policy TD learning convergence.

Findings

01

Established convergence of two time-scale stochastic approximation with controlled Markov noise.

02

Provided a solution for off-policy temporal difference learning with linear function approximation.

03

Linked the asymptotic behavior to differential inclusions based on ergodic occupation measures.

Abstract

We present for the first time an asymptotic convergence analysis of two time-scale stochastic approximation driven by `controlled' Markov noise. In particular, both the faster and slower recursions have non-additive controlled Markov noise components in addition to martingale difference noise. We analyze the asymptotic behavior of our framework by relating it to limiting differential inclusions in both time-scales that are defined in terms of the ergodic occupation measures associated with the controlled Markov processes. Finally, we present a solution to the off-policy convergence problem for temporal difference learning with linear function approximation, using our results.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.