Soft Q Network

Jingbin Liu; Shuai Liu; Xinyang Gu

arXiv:1912.10891·cs.LG·December 15, 2020

Soft Q Network

Jingbin Liu, Shuai Liu, Xinyang Gu

PDF

Open Access

TL;DR

This paper introduces the Soft Q Network (SQN), integrating entropy regularization into Deep Q Networks to improve exploration, stability, and efficiency, demonstrated through experiments on the Google Research Football environment.

Contribution

The paper proposes SQN with entropy regularization, revealing the connection between soft Q learning and policy improvement, and introduces an on-policy deep Q learning algorithm, QOP.

Findings

01

QOP shows high stability in training GRF agents

02

SQN benefits from entropy regularization for better exploration

03

QOP outperforms traditional methods in the GRF environment

Abstract

Deep Q Network (DQN) is a very successful algorithm, yet the inherent problem of reinforcement learning, i.e. the exploit-explore balance, remains. In this work, we introduce entropy regularization into DQN and propose SQN. We find that the backup equation of soft Q learning can enjoy the corrective feedback if we view the soft backup as policy improvement in the form of Q, instead of policy evaluation. We show that Soft Q Learning with Corrective Feedback (SQL-CF) underlies the on-plicy nature of SQL and the equivalence of SQL and Soft Policy Gradient (SPG). With these insights, we propose an on-policy version of deep Q learning algorithm, i.e. Q On-Policy (QOP). We experiment with QOP on a self-play environment called Google Research Football (GRF). The QOP algorithm exhibits great stability and efficiency in training GRF agents.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Stream Mining Techniques · Reinforcement Learning in Robotics · Advanced Bandit Algorithms Research

MethodsQ-Learning · Entropy Regularization · Dense Connections · Convolution · Deep Q-Network