Expected Policy Gradients for Reinforcement Learning

Kamil Ciosek; Shimon Whiteson

arXiv:1801.03326·stat.ML·May 5, 2020·21 cites

Expected Policy Gradients for Reinforcement Learning

Kamil Ciosek, Shimon Whiteson

PDF

Open Access

TL;DR

Expected Policy Gradients (EPG) unify stochastic and deterministic policy gradients, reducing variance and improving performance in reinforcement learning across continuous and discrete actions.

Contribution

The paper introduces EPG, a novel framework that generalizes policy gradients, providing analytical methods for various policies and demonstrating superior empirical results.

Findings

01

EPG reduces gradient estimate variance.

02

EPG outperforms existing methods in control tasks.

03

A new policy gradient theorem unifies stochastic and deterministic approaches.

Abstract

We propose expected policy gradients (EPG), which unify stochastic policy gradients (SPG) and deterministic policy gradients (DPG) for reinforcement learning. Inspired by expected sarsa, EPG integrates (or sums) across actions when estimating the gradient, instead of relying only on the action in the sampled trajectory. For continuous action spaces, we first derive a practical result for Gaussian policies and quadratic critics and then extend it to a universal analytical method, covering a broad class of actors and critics, including Gaussian, exponential families, and policies with bounded support. For Gaussian policies, we introduce an exploration method that uses covariance proportional to the matrix exponential of the scaled Hessian of the critic with respect to the actions. For discrete action spaces, we derive a variant of EPG based on softmax policies. We also establish a new…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Adversarial Robustness in Machine Learning · Adaptive Dynamic Programming Control

MethodsSoftmax