TL;DR
This paper introduces ADFQ, a Bayesian off-policy TD method using Assumed Density Filtering to update beliefs on Q-values, improving exploration, regularization, and performance in stochastic and large action space environments.
Contribution
The paper presents a novel Bayesian approach to off-policy TD learning, providing a closed-form update for Q-beliefs and extending it with neural networks for enhanced performance.
Findings
Outperforms comparable algorithms on Atari games
Shows significant improvements in stochastic domains
Handles large action spaces effectively
Abstract
While off-policy temporal difference (TD) methods have widely been used in reinforcement learning due to their efficiency and simple implementation, their Bayesian counterparts have not been utilized as frequently. One reason is that the non-linear max operation in the Bellman optimality equation makes it difficult to define conjugate distributions over the value functions. In this paper, we introduce a novel Bayesian approach to off-policy TD methods, called as ADFQ, which updates beliefs on state-action values, Q, through an online Bayesian inference method known as Assumed Density Filtering. We formulate an efficient closed-form solution for the value update by approximately estimating analytic parameters of the posterior of the Q-beliefs. Uncertainty measures in the beliefs not only are used in exploration but also provide a natural regularization for the value update considering…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsQ-Learning
