A Function Approximation Approach to Estimation of Policy Gradient for POMDP with Structured Policies
Huizhen Yu

TL;DR
This paper introduces a novel Actor-Critic approach for estimating policy gradients in POMDPs with structured policies, using a state-independent value function approximation, extending existing methods to partially observable and semi-Markov settings.
Contribution
It proposes a new Actor-Critic framework for POMDPs that employs a state-independent value function approximation, enabling efficient gradient estimation.
Findings
The critic computes a state-independent value function using TD methods.
The approach extends Actor-Critic algorithms to POMDPs and semi-Markov problems.
The method offers an alternative to the GPOMDP algorithm for POMDPs.
Abstract
We consider the estimation of the policy gradient in partially observable Markov decision processes (POMDP) with a special class of structured policies that are finite-state controllers. We show that the gradient estimation can be done in the Actor-Critic framework, by making the critic compute a "value" function that does not depend on the states of POMDP. This function is the conditional mean of the true value function that depends on the states. We show that the critic can be implemented using temporal difference (TD) methods with linear function approximations, and the analytical results on TD and Actor-Critic can be transfered to this case. Although Actor-Critic algorithms have been used extensively in Markov decision processes (MDP), up to now they have not been proposed for POMDP as an alternative to the earlier proposal GPOMDP algorithm, an actor-only method. Furthermore, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Control Systems Optimization · Adversarial Robustness in Machine Learning
