Experiments with Infinite-Horizon, Policy-Gradient Estimation
J. Baxter, P. L. Bartlett, L. Weaver

TL;DR
This paper introduces algorithms for optimizing average reward in POMDPs using a biased gradient estimation method called GPOMDP, which is simple, parameter-efficient, and applicable to complex, infinite spaces.
Contribution
It extends the GPOMDP algorithm for infinite-horizon policy gradient estimation in POMDPs, demonstrating its practical effectiveness and theoretical properties.
Findings
Effective gradient estimates for POMDPs with infinite states
Algorithms perform well on realistic problems
Single parameter beta balances bias and variance
Abstract
In this paper, we present algorithms that perform gradient ascent of the average reward in a partially observable Markov decision process (POMDP). These algorithms are based on GPOMDP, an algorithm introduced in a companion paper (Baxter and Bartlett, this volume), which computes biased estimates of the performance gradient in POMDPs. The algorithm's chief advantages are that it uses only one free parameter beta, which has a natural interpretation in terms of bias-variance trade-off, it requires no knowledge of the underlying state, and it can be applied to infinite state, control and observation spaces. We show how the gradient estimates produced by GPOMDP can be used to perform gradient ascent, both with a traditional stochastic-gradient algorithm, and with an algorithm based on conjugate-gradients that utilizes gradient information to bracket maxima in line searches. Experimental…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
