Off-Policy Deep Reinforcement Learning by Bootstrapping the Covariate Shift
Carles Gelada, Marc G. Bellemare

TL;DR
This paper improves off-policy reinforcement learning by introducing a discount factor and a soft normalization penalty to stabilize value updates, demonstrating better theoretical properties and empirical performance on Atari games.
Contribution
It extends COP-TD with discounting and a soft normalization method, enabling stable nonlinear function approximation in off-policy RL.
Findings
Discounted COP-TD is more stable theoretically.
Soft normalization eliminates the need for projection.
Empirical gains on Atari games with the proposed methods.
Abstract
In this paper we revisit the method of off-policy corrections for reinforcement learning (COP-TD) pioneered by Hallak et al. (2017). Under this method, online updates to the value function are reweighted to avoid divergence issues typical of off-policy learning. While Hallak et al.'s solution is appealing, it cannot easily be transferred to nonlinear function approximation. First, it requires a projection step onto the probability simplex; second, even though the operator describing the expected behavior of the off-policy learning algorithm is convergent, it is not known to be a contraction mapping, and hence, may be more unstable in practice. We address these two issues by introducing a discount factor into COP-TD. We analyze the behavior of discounted COP-TD and find it better behaved from a theoretical perspective. We also propose an alternative soft normalization penalty that can be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Adaptive Dynamic Programming Control
