Is the Policy Gradient a Gradient?

Chris Nota; Philip S. Thomas

arXiv:1906.07073·cs.LG·March 2, 2020·6 cites

Is the Policy Gradient a Gradient?

Chris Nota, Philip S. Thomas

PDF

Open Access

TL;DR

This paper clarifies that most policy gradient methods do not optimize the discounted return and instead follow a direction that is not a true gradient, leading to potential convergence issues.

Contribution

It proves that common policy gradient updates are not true gradients and demonstrates their potential to converge to suboptimal fixed points, clarifying a longstanding confusion.

Findings

01

Most policy gradient updates are not true gradients.

02

Counterexample shows convergence to globally pessimal fixed points.

03

Widespread misunderstandings in literature are corrected.

Abstract

The policy gradient theorem describes the gradient of the expected discounted return with respect to an agent's policy parameters. However, most policy gradient methods drop the discount factor from the state distribution and therefore do not optimize the discounted objective. What do they optimize instead? This has been an open question for several years, and this lack of theoretical clarity has lead to an abundance of misstatements in the literature. We answer this question by proving that the update direction approximated by most methods is not the gradient of any function. Further, we argue that algorithms that follow this direction are not guaranteed to converge to a "reasonable" fixed point by constructing a counterexample wherein the fixed point is globally pessimal with respect to both the discounted and undiscounted objectives. We motivate this work by surveying the literature…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Optimization and Search Problems