Infinite-Horizon Policy-Gradient Estimation
Jonathan Baxter, Peter L. Bartlett

TL;DR
This paper introduces GPOMDP, a simulation-based algorithm for estimating the gradient of average reward in POMDPs, requiring minimal storage, no knowledge of the underlying state, and offering convergence guarantees.
Contribution
The paper presents GPOMDP, a novel biased gradient estimation algorithm for POMDPs that is simple, efficient, and provably convergent, with extensions to various complex settings.
Findings
GPOMDP requires only twice the number of policy parameters in storage.
The bias-variance trade-off controlled by parameter β affects convergence.
The algorithm converges under certain mixing time conditions.
Abstract
Gradient-based approaches to direct policy search in reinforcement learning have received much recent attention as a means to solve problems of partial observability and to avoid some of the problems associated with policy degradation in value-function methods. In this paper we introduce GPOMDP, a simulation-based algorithm for generating a {\em biased} estimate of the gradient of the {\em average reward} in Partially Observable Markov Decision Processes (POMDPs) controlled by parameterized stochastic policies. A similar algorithm was proposed by Kimura, Yamamura, and Kobayashi (1995). The algorithm's chief advantages are that it requires storage of only twice the number of policy parameters, uses one free parameter (which has a natural interpretation in terms of bias-variance trade-off), and requires no knowledge of the underlying state. We prove convergence of GPOMDP,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
