Equivalence of stochastic and deterministic policy gradients
Emo Todorov

TL;DR
This paper demonstrates the theoretical equivalence between stochastic and deterministic policy gradients in certain MDPs, and proposes a method to convert stochastic policies into deterministic ones using sufficient statistics.
Contribution
It establishes the conditions under which stochastic and deterministic policy gradients are equivalent and introduces a procedure to construct equivalent deterministic MDPs.
Findings
Stochastic and deterministic policy gradients are identical in Gaussian control noise MDPs.
State value functions are identical, but state-control value functions differ.
A method to convert stochastic policies into deterministic policies using sufficient statistics.
Abstract
Policy gradients in continuous control have been derived for both stochastic and deterministic policies. Here we study the relationship between the two. In a widely-used family of MDPs involving Gaussian control noise and quadratic control costs, we show that the stochastic and deterministic policy gradients, natural gradients, and state value functions are identical; while the state-control value functions are different. We then develop a general procedure for constructing an MDP with deterministic policy that is equivalent to a given MDP with stochastic policy. The controls of this new MDP are the sufficient statistics of the stochastic policy in the original MDP. Our results suggest that policy gradient methods can be unified by approximating state value functions rather than state-control value functions.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Adaptive Dynamic Programming Control · Advanced Bandit Algorithms Research
