Leveraging the Variance of Return Sequences for Exploration Policy

Zerong Xi; Gita Sukthankar

arXiv:2011.08649·cs.LG·November 18, 2020

Leveraging the Variance of Return Sequences for Exploration Policy

Zerong Xi, Gita Sukthankar

PDF

Open Access

TL;DR

This paper proposes a novel exploration method in reinforcement learning that uses the variance of return sequences and TD errors to guide exploration, demonstrating improved performance on Atari games.

Contribution

It introduces a two-stream network architecture to estimate variance and TD errors for exploration in DQN agents, enhancing exploration efficiency.

Findings

01

Outperforms baseline on multiple Atari games

02

Variance and TD errors effectively guide exploration

03

Two-stream network improves estimation accuracy

Abstract

This paper introduces a method for constructing an upper bound for exploration policy using either the weighted variance of return sequences or the weighted temporal difference (TD) error. We demonstrate that the variance of the return sequence for a specific state-action pair is an important information source that can be leveraged to guide exploration in reinforcement learning. The intuition is that fluctuation in the return sequence indicates greater uncertainty in the near future returns. This divergence occurs because of the cyclic nature of value-based reinforcement learning; the evolving value function begets policy improvements which in turn modify the value function. Although both variance and TD errors capture different aspects of this uncertainty, our analysis shows that both can be valuable to guide exploration. We propose a two-stream network architecture to estimate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReservoir Engineering and Simulation Methods · Reinforcement Learning in Robotics · Distributed and Parallel Computing Systems

MethodsDense Connections · Convolution · Q-Learning · Deep Q-Network