Policy Gradient Algorithms Implicitly Optimize by Continuation
Adrien Bolland, Gilles Louppe, Damien Ernst

TL;DR
This paper offers a new theoretical perspective on policy-gradient algorithms in reinforcement learning, framing them as implicit continuation optimizations that enhance exploration and policy variance adaptation.
Contribution
It introduces a continuation framework for policy optimization and interprets entropy regularization as implicit deterministic policy optimization.
Findings
Policy gradients can be viewed as continuation methods.
Entropy regularization implicitly optimizes deterministic policies.
Policy variance should adapt based on history to improve exploration.
Abstract
Direct policy optimization in reinforcement learning is usually solved with policy-gradient algorithms, which optimize policy parameters via stochastic gradient ascent. This paper provides a new theoretical interpretation and justification of these algorithms. First, we formulate direct policy optimization in the optimization by continuation framework. The latter is a framework for optimizing nonconvex functions where a sequence of surrogate objective functions, called continuations, are locally optimized. Second, we show that optimizing affine Gaussian policies and performing entropy regularization can be interpreted as implicitly optimizing deterministic policies by continuation. Based on these theoretical results, we argue that exploration in policy-gradient algorithms consists in computing a continuation of the return of the policy at hand, and that the variance of policies should…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Stochastic Gradient Optimization Techniques · Gaussian Processes and Bayesian Inference
MethodsEntropy Regularization
