Leveraging the Variance of Return Sequences for Exploration Policy
Zerong Xi, Gita Sukthankar

TL;DR
This paper proposes a novel exploration method in reinforcement learning that uses the variance of return sequences and TD errors to guide exploration, demonstrating improved performance on Atari games.
Contribution
It introduces a two-stream network architecture to estimate variance and TD errors for exploration in DQN agents, enhancing exploration efficiency.
Findings
Outperforms baseline on multiple Atari games
Variance and TD errors effectively guide exploration
Two-stream network improves estimation accuracy
Abstract
This paper introduces a method for constructing an upper bound for exploration policy using either the weighted variance of return sequences or the weighted temporal difference (TD) error. We demonstrate that the variance of the return sequence for a specific state-action pair is an important information source that can be leveraged to guide exploration in reinforcement learning. The intuition is that fluctuation in the return sequence indicates greater uncertainty in the near future returns. This divergence occurs because of the cyclic nature of value-based reinforcement learning; the evolving value function begets policy improvements which in turn modify the value function. Although both variance and TD errors capture different aspects of this uncertainty, our analysis shows that both can be valuable to guide exploration. We propose a two-stream network architecture to estimate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReservoir Engineering and Simulation Methods · Reinforcement Learning in Robotics · Distributed and Parallel Computing Systems
MethodsDense Connections · Convolution · Q-Learning · Deep Q-Network
