Is the Policy Gradient a Gradient?
Chris Nota, Philip S. Thomas

TL;DR
This paper clarifies that most policy gradient methods do not optimize the discounted return and instead follow a direction that is not a true gradient, leading to potential convergence issues.
Contribution
It proves that common policy gradient updates are not true gradients and demonstrates their potential to converge to suboptimal fixed points, clarifying a longstanding confusion.
Findings
Most policy gradient updates are not true gradients.
Counterexample shows convergence to globally pessimal fixed points.
Widespread misunderstandings in literature are corrected.
Abstract
The policy gradient theorem describes the gradient of the expected discounted return with respect to an agent's policy parameters. However, most policy gradient methods drop the discount factor from the state distribution and therefore do not optimize the discounted objective. What do they optimize instead? This has been an open question for several years, and this lack of theoretical clarity has lead to an abundance of misstatements in the literature. We answer this question by proving that the update direction approximated by most methods is not the gradient of any function. Further, we argue that algorithms that follow this direction are not guaranteed to converge to a "reasonable" fixed point by constructing a counterexample wherein the fixed point is globally pessimal with respect to both the discounted and undiscounted objectives. We motivate this work by surveying the literature…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Optimization and Search Problems
