EUBRL: Epistemic Uncertainty Directed Bayesian Reinforcement Learning
Jianfei Ma, Wee Sun Lee

TL;DR
EUBRL is a Bayesian reinforcement learning algorithm that uses epistemic uncertainty to guide exploration, achieving near-optimal regret and sample complexity guarantees, especially effective in complex, sparse reward environments.
Contribution
The paper introduces EUBRL, a novel Bayesian RL method that adaptively guides exploration using epistemic uncertainty, with theoretical guarantees and empirical validation.
Findings
EUBRL achieves superior sample efficiency in complex tasks.
It demonstrates scalability and consistency across various environments.
Theoretical analysis shows near-minimax optimal regret bounds.
Abstract
At the boundary between the known and the unknown, an agent inevitably confronts the dilemma of whether to explore or to exploit. Epistemic uncertainty reflects such boundaries, representing systematic uncertainty due to limited knowledge. In this paper, we propose a Bayesian reinforcement learning (RL) algorithm, , which leverages epistemic guidance to achieve principled exploration. This guidance adaptively reduces per-step regret arising from estimation errors. We establish nearly minimax-optimal regret and sample complexity guarantees for a class of sufficiently expressive priors in infinite-horizon discounted MDPs. Empirically, we evaluate on tasks characterized by sparse rewards, long horizons, and stochasticity. Results demonstrate that achieves superior sample efficiency, scalability, and consistency.
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper tackles the fundamental exploration–exploitation dilemma in reinforcement learning by introducing a novel approach termed “epistemic guidance.” The method is well-motivated, and its effectiveness is supported through both rigorous theoretical analysis and comprehensive empirical evaluation. 2. The paper argues that adding an exploration bonus to the reward estimate is a flawed way to do exploration in BAMDPs because the reward estimate can be highly uncertain and can result in a po
1. The paper's analysis is confined to discrete state-action spaces, and it does not address the challenges of integrating its method with deep function approximation. The reliance on maintaining an explicit Bayesian posterior is computationally intractable for the high-dimensional environments where deep RL is typically applied. Therefore, it is unclear how the 'epistemically guided reward' could be effectively approximated to improve sample efficiency in practical deep RL algorithms. 2. The
The paper is eloquently written and introduces a novel theoretical proof which, for the first time (to the best of my knowledge also), achieves nearly minimax-optimal sample complexity in infinite-horizon discounted MDPs without assuming a generative model. This result improves on He at al. 2021 which shows nearly minimax-optimal regret but doesn’t extend to sample complexity. The theoretical results are backed up by convincing empirical results (using multiple seeds, reporting standard errors e
Towards the goal of disentangling exploration and exploitation, the evidence could be strengthened by e,g., considering an ablation (see questions). The accessibility of the paper to a wider audience could also benefit from adding short intuitive summaries after key lemmas in the appendices.
1.To the best of my knowledge, this is the first work to convert epistemic uncertainty into an explicit guidance weight $P(U=1 \mid s, a)$. This idea is novel, naturally decouples exploration from uncertain reward estimates, and provides an adaptive interpolation weight of the reward signal. 2.Theoretical guarantees are strong. The paper gives (i) a regret bound $\tilde{O}(\sqrt{S A T} /(1- \gamma)^{1.5}+S^2 A /(1-\gamma)^2$ ) that matches known lower bounds when $T$ is large enough, and (ii)
1.The algorithmic description is quite high-level. More concrete details and examples would help reproducibility-for instance: how $\mathcal{E}(s, a)$ is computed in practice; how the on-policy estimate $P(U=1| s, a)= \mathcal{E}\_{b}/ \mathcal{E}\_{\max}$ is formed; and how $\mathcal{E}_{\max }$ is chosen. 2.Some notation and concepts in the main text need clearer definitions. For example: what is the role of $w$ in Section 3.1? How is a "prior" specified precisely (i.e., prior over what objec
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Adversarial Robustness in Machine Learning
