Bayesian Exploration Networks
Mattie Fellows, Brandon Kaplowitz, Christian Schroeder de Witt and, Shimon Whiteson

TL;DR
This paper introduces Bayesian Exploration Networks (BEN), a novel model-free approach that learns true Bayes-optimal policies in reinforcement learning by modeling both aleatoric and epistemic uncertainties, outperforming existing methods.
Contribution
The paper presents the first analysis showing model-free methods can achieve Bayes-optimality and introduces BEN, which uses normalising flows for uncertainty modeling to attain this goal.
Findings
BEN learns true Bayes-optimal policies in complex tasks.
Existing model-free approaches are often arbitrarily Bayes-suboptimal.
Empirical results show BEN outperforms current methods in relevant tasks.
Abstract
Bayesian reinforcement learning (RL) offers a principled and elegant approach for sequential decision making under uncertainty. Most notably, Bayesian agents do not face an exploration/exploitation dilemma, a major pathology of frequentist methods. However theoretical understanding of model-free approaches is lacking. In this paper, we introduce a novel Bayesian model-free formulation and the first analysis showing that model-free approaches can yield Bayes-optimal policies. We show all existing model-free approaches make approximations that yield policies that can be arbitrarily Bayes-suboptimal. As a first step towards model-free Bayes optimality, we introduce the Bayesian exploration network (BEN) which uses normalising flows to model both the aleatoric uncertainty (via density estimation) and epistemic uncertainty (via variational inference) in the Bellman operator. In the limit of…
Peer Reviews
Decision·ICML 2024 Poster
- Clear communication. The authors present prior work and their own work in a clear and concise manner. The logical flow of the paper is very nice. - The authors are very clear about the shortcomings of prior model-free BRL methods and how exactly their proposed approach addresses these shortcomings. - The structure of BENs is not overly complicated. They use well-known building blocks, such as Q-function approximating functions and normalizing flows, to address the need to model uncertainty in
- While the authors do a nice job of reviewing prior literature, the magnitude of the contribution presented here is not clear. I am inclined to say that the importance of the authors' contributions is relatively low, although they are novel. The theoretical results showing the shortcomings of other model-free BRL approaches is arguably their most important contribution, but it's not clear that that's a sufficient contribution in isolation. I view their formulation of BENs as less impactful. - T
Bayesian RL is an important line of work that provides a solution to the exploration but suffers from computational complexity. Any progress on this front, as a result, should be relevant to a significant portion of ICLR's community. Additionally, as described in the paper, while model-based methods have proliferated, model-free approaches have seen less attention. Hence, this work is an important contribution. An important aspect of this is the rigorous theory to support the rather novel pers
Two key rooms for improvement are the clarity (presentation) and experimental section. First, while the theoretical support in the appendix is certainly substantial, I found it rather difficult to follow key parts of the method description. I believe this is, first, because the (general/theoretical) learning objective and its concrete (practical, normalizing flow network approximation) implementation are presented simultaneously. Potential more important is the fact that the method description
The paper introduces a novel method that models epistemic and aleatoric uncertainties using normalizing flows. This could be a valuable contribution to the field of reinforcement learning.
1. **Mathematical Rigor:** Although the paper provides extensive mathematical analysis, there's a discernible lack of mathematical rigor. * _Theorem 1:_ The theorem hinges on Lemma 1. However, Lemma 1's proof is questionable as its final equality doesn't hold true. The MDP parameter, $\phi$, doesn't impact future decisions made by the contextual policy, which depend solely on history. * _Theorem 2:_ The theorem states that the posterior of a mis-specified deterministic model concentrat
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Adversarial Robustness in Machine Learning · Bayesian Modeling and Causal Inference
Methodsfail
