EUBRL: Epistemic Uncertainty Directed Bayesian Reinforcement Learning

Jianfei Ma; Wee Sun Lee

arXiv:2512.15405·cs.LG·March 3, 2026

EUBRL: Epistemic Uncertainty Directed Bayesian Reinforcement Learning

Jianfei Ma, Wee Sun Lee

PDF

Open Access 3 Reviews

TL;DR

EUBRL is a Bayesian reinforcement learning algorithm that uses epistemic uncertainty to guide exploration, achieving near-optimal regret and sample complexity guarantees, especially effective in complex, sparse reward environments.

Contribution

The paper introduces EUBRL, a novel Bayesian RL method that adaptively guides exploration using epistemic uncertainty, with theoretical guarantees and empirical validation.

Findings

01

EUBRL achieves superior sample efficiency in complex tasks.

02

It demonstrates scalability and consistency across various environments.

03

Theoretical analysis shows near-minimax optimal regret bounds.

Abstract

At the boundary between the known and the unknown, an agent inevitably confronts the dilemma of whether to explore or to exploit. Epistemic uncertainty reflects such boundaries, representing systematic uncertainty due to limited knowledge. In this paper, we propose a Bayesian reinforcement learning (RL) algorithm, $EUBRL$ , which leverages epistemic guidance to achieve principled exploration. This guidance adaptively reduces per-step regret arising from estimation errors. We establish nearly minimax-optimal regret and sample complexity guarantees for a class of sufficiently expressive priors in infinite-horizon discounted MDPs. Empirically, we evaluate $EUBRL$ on tasks characterized by sparse rewards, long horizons, and stochasticity. Results demonstrate that $EUBRL$ achieves superior sample efficiency, scalability, and consistency.

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

1. The paper tackles the fundamental exploration–exploitation dilemma in reinforcement learning by introducing a novel approach termed “epistemic guidance.” The method is well-motivated, and its effectiveness is supported through both rigorous theoretical analysis and comprehensive empirical evaluation. 2. The paper argues that adding an exploration bonus to the reward estimate is a flawed way to do exploration in BAMDPs because the reward estimate can be highly uncertain and can result in a po

Weaknesses

1. The paper's analysis is confined to discrete state-action spaces, and it does not address the challenges of integrating its method with deep function approximation. The reliance on maintaining an explicit Bayesian posterior is computationally intractable for the high-dimensional environments where deep RL is typically applied. Therefore, it is unclear how the 'epistemically guided reward' could be effectively approximated to improve sample efficiency in practical deep RL algorithms. 2. The

Reviewer 02Rating 8Confidence 3

Strengths

The paper is eloquently written and introduces a novel theoretical proof which, for the first time (to the best of my knowledge also), achieves nearly minimax-optimal sample complexity in infinite-horizon discounted MDPs without assuming a generative model. This result improves on He at al. 2021 which shows nearly minimax-optimal regret but doesn’t extend to sample complexity. The theoretical results are backed up by convincing empirical results (using multiple seeds, reporting standard errors e

Weaknesses

Towards the goal of disentangling exploration and exploitation, the evidence could be strengthened by e,g., considering an ablation (see questions). The accessibility of the paper to a wider audience could also benefit from adding short intuitive summaries after key lemmas in the appendices.

Reviewer 03Rating 6Confidence 3

Strengths

1.To the best of my knowledge, this is the first work to convert epistemic uncertainty into an explicit guidance weight $P(U=1 \mid s, a)$. This idea is novel, naturally decouples exploration from uncertain reward estimates, and provides an adaptive interpolation weight of the reward signal. 2.Theoretical guarantees are strong. The paper gives (i) a regret bound $\tilde{O}(\sqrt{S A T} /(1- \gamma)^{1.5}+S^2 A /(1-\gamma)^2$ ) that matches known lower bounds when $T$ is large enough, and (ii)

Weaknesses

1.The algorithmic description is quite high-level. More concrete details and examples would help reproducibility-for instance: how $\mathcal{E}(s, a)$ is computed in practice; how the on-policy estimate $P(U=1| s, a)= \mathcal{E}\_{b}/ \mathcal{E}\_{\max}$ is formed; and how $\mathcal{E}_{\max }$ is chosen. 2.Some notation and concepts in the main text need clearer definitions. For example: what is the role of $w$ in Section 3.1? How is a "prior" specified precisely (i.e., prior over what objec

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Adversarial Robustness in Machine Learning