Bayesian Distributional Policy Gradients
Luchen Li, A. Aldo Faisal

TL;DR
This paper introduces BDPG, a novel distributional RL algorithm that models state-return distributions, enabling better exploration and faster learning, demonstrated on Atari and MuJoCo benchmarks.
Contribution
It models state-return distributions and uses adversarial training to estimate return uncertainties, integrating curiosity-driven exploration into distributional RL.
Findings
BDPG learns faster than existing algorithms.
Achieves higher asymptotic performance.
Effective in hard-exploration tasks.
Abstract
Distributional Reinforcement Learning (RL) maintains the entire probability distribution of the reward-to-go, i.e. the return, providing more learning signals that account for the uncertainty associated with policy performance, which may be beneficial for trading off exploration and exploitation and policy learning in general. Previous works in distributional RL focused mainly on computing the state-action-return distributions, here we model the state-return distributions. This enables us to translate successful conventional RL algorithms that are based on state values into distributional RL. We formulate the distributional Bellman operation as an inference-based auto-encoding process that minimises Wasserstein metrics between target/model return distributions. The proposed algorithm, BDPG (Bayesian Distributional Policy Gradients), uses adversarial training in joint-contrastive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Adversarial Robustness in Machine Learning · Advanced Bandit Algorithms Research
