Exploration with Multi-Sample Target Values for Distributional Reinforcement Learning
Michael Teng, Michiel van de Panne, Frank Wood

TL;DR
This paper introduces multi-sample target values for distributional reinforcement learning, improving distribution estimates and exploration, leading to state-of-the-art results in continuous control tasks.
Contribution
It proposes multi-sample target values and UCB-based exploration for distributional RL, resulting in the new E2DC algorithm with superior performance.
Findings
Achieves state-of-the-art results on Humanoid control
Demonstrates improved distribution estimates during training
Provides insights through visualization of learned distributions
Abstract
Distributional reinforcement learning (RL) aims to learn a value-network that predicts the full distribution of the returns for a given state, often modeled via a quantile-based critic. This approach has been successfully integrated into common RL methods for continuous control, giving rise to algorithms such as Distributional Soft Actor-Critic (DSAC). In this paper, we introduce multi-sample target values (MTV) for distributional RL, as a principled replacement for single-sample target value estimation, as commonly employed in current practice. The improved distributional estimates further lend themselves to UCB-based exploration. These two ideas are combined to yield our distributional RL algorithm, E2DC (Extra Exploration with Distributional Critics). We evaluate our approach on a range of continuous control tasks and demonstrate state-of-the-art model-free performance on difficult…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics
