Reinforcement Learning in POMDP's via Direct Gradient Ascent
Jonathan Baxter, Peter L. Bartlett

TL;DR
This paper introduces GPOMDP, a gradient-based algorithm for optimizing policies in POMDPs that requires only a single sample path and no knowledge of the underlying state, with proven convergence.
Contribution
The paper presents GPOMDP, a novel REINFORCE-like algorithm for POMDPs that simplifies gradient estimation and demonstrates its convergence and practical use in policy optimization.
Findings
GPOMDP requires only one sample path for gradient estimation.
The algorithm has a single free parameter with a clear bias-variance interpretation.
Convergence of GPOMDP is theoretically proven.
Abstract
This paper discusses theoretical and experimental aspects of gradient-based approaches to the direct optimization of policy performance in controlled POMDPs. We introduce GPOMDP, a REINFORCE-like algorithm for estimating an approximation to the gradient of the average reward as a function of the parameters of a stochastic policy. The algorithm's chief advantages are that it requires only a single sample path of the underlying Markov chain, it uses only one free parameter , which has a natural interpretation in terms of bias-variance trade-off, and it requires no knowledge of the underlying state. We prove convergence of GPOMDP and show how the gradient estimates produced by GPOMDP can be used in a conjugate-gradient procedure to find local optima of the average reward.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Machine Learning and Algorithms
