Policy Evaluation and Temporal-Difference Learning in Continuous Time   and Space: A Martingale Approach

Yanwei Jia; Xun Yu Zhou

arXiv:2108.06655·cs.LG·February 2, 2022

Policy Evaluation and Temporal-Difference Learning in Continuous Time and Space: A Martingale Approach

Yanwei Jia, Xun Yu Zhou

PDF

1 Video

TL;DR

This paper introduces a martingale-based framework for policy evaluation and TD methods in continuous time and space, revealing new insights and algorithms with proven convergence properties.

Contribution

It develops a unified martingale approach to policy evaluation, connecting classical TD algorithms to continuous-time martingale conditions and providing new algorithms with convergence guarantees.

Findings

01

Martingale characterization of policy evaluation in continuous settings

02

New algorithms based on martingale loss and orthogonality conditions

03

Convergence of discretized algorithms to continuous-time solutions

Abstract

We propose a unified framework to study policy evaluation (PE) and the associated temporal difference (TD) methods for reinforcement learning in continuous time and space. We show that PE is equivalent to maintaining the martingale condition of a process. From this perspective, we find that the mean--square TD error approximates the quadratic variation of the martingale and thus is not a suitable objective for PE. We present two methods to use the martingale characterization for designing PE algorithms. The first one minimizes a "martingale loss function", whose solution is proved to be the best approximation of the true value function in the mean--square sense. This method interprets the classical gradient Monte-Carlo algorithm. The second method is based on a system of equations called the "martingale orthogonality conditions" with test functions. Solving these equations in different…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Policy Evaluation and Temporal-Difference Learning in Continuous Time and Space: A Martingale Approach· slideslive