Equivalence Between Policy Gradients and Soft Q-Learning

John Schulman; Xi Chen; Pieter Abbeel

arXiv:1704.06440·cs.LG·October 16, 2018·198 cites

Equivalence Between Policy Gradients and Soft Q-Learning

John Schulman, Xi Chen, Pieter Abbeel

PDF

Open Access

TL;DR

This paper demonstrates a precise equivalence between entropy-regularized Q-learning and policy gradient methods, providing theoretical insights and empirical evidence of their similar performance on benchmarks.

Contribution

It establishes a formal equivalence between soft Q-learning and policy gradient methods in entropy-regularized reinforcement learning, clarifying their relationship.

Findings

01

Soft Q-learning is exactly equivalent to a policy gradient method.

02

Entropy-regularized versions perform well on Atari benchmarks.

03

Constructed Q-learning method matches A3C dynamics without certain heuristics.

Abstract

Two of the leading approaches for model-free reinforcement learning are policy gradient methods and $Q$ -learning methods. $Q$ -learning methods can be effective and sample-efficient when they work, however, it is not well-understood why they work, since empirically, the $Q$ -values they estimate are very inaccurate. A partial explanation may be that $Q$ -learning methods are secretly implementing policy gradient updates: we show that there is a precise equivalence between $Q$ -learning and policy gradient methods in the setting of entropy-regularized reinforcement learning, that "soft" (entropy-regularized) $Q$ -learning is exactly equivalent to a policy gradient method. We also point out a connection between $Q$ -learning methods and natural policy gradient methods. Experimentally, we explore the entropy-regularized versions of $Q$ -learning and policy gradients, and we find them to perform…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Age of Information Optimization · Advanced Bandit Algorithms Research

MethodsEntropy Regularization · Dense Connections · Softmax · Convolution · A3C