Equivalence Between Policy Gradients and Soft Q-Learning
John Schulman, Xi Chen, Pieter Abbeel

TL;DR
This paper demonstrates a precise equivalence between entropy-regularized Q-learning and policy gradient methods, providing theoretical insights and empirical evidence of their similar performance on benchmarks.
Contribution
It establishes a formal equivalence between soft Q-learning and policy gradient methods in entropy-regularized reinforcement learning, clarifying their relationship.
Findings
Soft Q-learning is exactly equivalent to a policy gradient method.
Entropy-regularized versions perform well on Atari benchmarks.
Constructed Q-learning method matches A3C dynamics without certain heuristics.
Abstract
Two of the leading approaches for model-free reinforcement learning are policy gradient methods and -learning methods. -learning methods can be effective and sample-efficient when they work, however, it is not well-understood why they work, since empirically, the -values they estimate are very inaccurate. A partial explanation may be that -learning methods are secretly implementing policy gradient updates: we show that there is a precise equivalence between -learning and policy gradient methods in the setting of entropy-regularized reinforcement learning, that "soft" (entropy-regularized) -learning is exactly equivalent to a policy gradient method. We also point out a connection between -learning methods and natural policy gradient methods. Experimentally, we explore the entropy-regularized versions of -learning and policy gradients, and we find them to perform…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Age of Information Optimization · Advanced Bandit Algorithms Research
MethodsEntropy Regularization · Dense Connections · Softmax · Convolution · A3C
