Experiments with Infinite-Horizon, Policy-Gradient Estimation

J. Baxter; P. L. Bartlett; L. Weaver

arXiv:1106.0666·cs.AI·November 18, 2019

Experiments with Infinite-Horizon, Policy-Gradient Estimation

J. Baxter, P. L. Bartlett, L. Weaver

PDF

TL;DR

This paper introduces algorithms for optimizing average reward in POMDPs using a biased gradient estimation method called GPOMDP, which is simple, parameter-efficient, and applicable to complex, infinite spaces.

Contribution

It extends the GPOMDP algorithm for infinite-horizon policy gradient estimation in POMDPs, demonstrating its practical effectiveness and theoretical properties.

Findings

01

Effective gradient estimates for POMDPs with infinite states

02

Algorithms perform well on realistic problems

03

Single parameter beta balances bias and variance

Abstract

In this paper, we present algorithms that perform gradient ascent of the average reward in a partially observable Markov decision process (POMDP). These algorithms are based on GPOMDP, an algorithm introduced in a companion paper (Baxter and Bartlett, this volume), which computes biased estimates of the performance gradient in POMDPs. The algorithm's chief advantages are that it uses only one free parameter beta, which has a natural interpretation in terms of bias-variance trade-off, it requires no knowledge of the underlying state, and it can be applied to infinite state, control and observation spaces. We show how the gradient estimates produced by GPOMDP can be used to perform gradient ascent, both with a traditional stochastic-gradient algorithm, and with an algorithm based on conjugate-gradients that utilizes gradient information to bracket maxima in line searches. Experimental…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.