TL;DR
This paper introduces a martingale-based framework for policy evaluation and TD methods in continuous time and space, revealing new insights and algorithms with proven convergence properties.
Contribution
It develops a unified martingale approach to policy evaluation, connecting classical TD algorithms to continuous-time martingale conditions and providing new algorithms with convergence guarantees.
Findings
Martingale characterization of policy evaluation in continuous settings
New algorithms based on martingale loss and orthogonality conditions
Convergence of discretized algorithms to continuous-time solutions
Abstract
We propose a unified framework to study policy evaluation (PE) and the associated temporal difference (TD) methods for reinforcement learning in continuous time and space. We show that PE is equivalent to maintaining the martingale condition of a process. From this perspective, we find that the mean--square TD error approximates the quadratic variation of the martingale and thus is not a suitable objective for PE. We present two methods to use the martingale characterization for designing PE algorithms. The first one minimizes a "martingale loss function", whose solution is proved to be the best approximation of the true value function in the mean--square sense. This method interprets the classical gradient Monte-Carlo algorithm. The second method is based on a system of equations called the "martingale orthogonality conditions" with test functions. Solving these equations in different…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
