Learning Expected Emphatic Traces for Deep RL

Ray Jiang; Shangtong Zhang; Veronica Chelu; Adam White; Hado van; Hasselt

arXiv:2107.05405·cs.LG·July 13, 2021·1 cites

Learning Expected Emphatic Traces for Deep RL

Ray Jiang, Shangtong Zhang, Veronica Chelu, Adam White, Hado van, Hasselt

PDF

Open Access 1 Video

TL;DR

This paper introduces a novel method for combining emphatic weightings with off-line replay data in deep reinforcement learning, improving stability and scalability in off-policy training.

Contribution

It develops a multi-step emphatic weighting technique compatible with replay buffers and a time-reversed TD learning algorithm to estimate these weightings.

Findings

01

Reduced variance in emphatic weightings compared to prior methods

02

Convergence guarantees for the proposed approach

03

Improved performance of Atari agents using the new method

Abstract

Off-policy sampling and experience replay are key for improving sample efficiency and scaling model-free temporal difference learning methods. When combined with function approximation, such as neural networks, this combination is known as the deadly triad and is potentially unstable. Recently, it has been shown that stability and good performance at scale can be achieved by combining emphatic weightings and multi-step updates. This approach, however, is generally limited to sampling complete trajectories in order, to compute the required emphatic weighting. In this paper we investigate how to combine emphatic weightings with non-sequential, off-line data sampled from a replay buffer. We develop a multi-step emphatic weighting that can be combined with replay, and a time-reversed $n$ -step TD learning algorithm to learn the required emphatic weighting. We show that these state weightings…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Learning Expected Emphatic Traces for Deep RL· underline

Taxonomy

TopicsData Stream Mining Techniques · Advanced Bandit Algorithms Research · Reinforcement Learning in Robotics

MethodsExperience Replay