Off-Policy Deep Reinforcement Learning by Bootstrapping the Covariate   Shift

Carles Gelada; Marc G. Bellemare

arXiv:1901.09455·cs.LG·January 29, 2019·6 cites

Off-Policy Deep Reinforcement Learning by Bootstrapping the Covariate Shift

Carles Gelada, Marc G. Bellemare

PDF

Open Access

TL;DR

This paper improves off-policy reinforcement learning by introducing a discount factor and a soft normalization penalty to stabilize value updates, demonstrating better theoretical properties and empirical performance on Atari games.

Contribution

It extends COP-TD with discounting and a soft normalization method, enabling stable nonlinear function approximation in off-policy RL.

Findings

01

Discounted COP-TD is more stable theoretically.

02

Soft normalization eliminates the need for projection.

03

Empirical gains on Atari games with the proposed methods.

Abstract

In this paper we revisit the method of off-policy corrections for reinforcement learning (COP-TD) pioneered by Hallak et al. (2017). Under this method, online updates to the value function are reweighted to avoid divergence issues typical of off-policy learning. While Hallak et al.'s solution is appealing, it cannot easily be transferred to nonlinear function approximation. First, it requires a projection step onto the probability simplex; second, even though the operator describing the expected behavior of the off-policy learning algorithm is convergent, it is not known to be a contraction mapping, and hence, may be more unstable in practice. We address these two issues by introducing a discount factor into COP-TD. We analyze the behavior of discounted COP-TD and find it better behaved from a theoretical perspective. We also propose an alternative soft normalization penalty that can be…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Adaptive Dynamic Programming Control