Value Bonuses using Ensemble Errors for Exploration in Reinforcement Learning

Abdul Wahab; Raksha Kumaraswamy; Martha White

arXiv:2602.12375·cs.LG·February 16, 2026

Value Bonuses using Ensemble Errors for Exploration in Reinforcement Learning

Abdul Wahab, Raksha Kumaraswamy, Martha White

PDF

Open Access 3 Reviews

TL;DR

This paper introduces VBE, an exploration algorithm in reinforcement learning that uses ensemble errors to create value bonuses, promoting first-visit optimism and deep exploration, outperforming existing methods in various environments.

Contribution

The paper proposes VBE, a novel exploration method leveraging ensemble errors to generate value bonuses that encourage first-visit optimism in RL agents.

Findings

01

VBE outperforms Bootstrap DQN, RND, and ACB in classic exploration environments.

02

VBE scales effectively to complex environments like Atari.

03

The method provides deep exploration by decreasing value bonuses to zero after initial visits.

Abstract

Optimistic value estimates provide one mechanism for directed exploration in reinforcement learning (RL). The agent acts greedily with respect to an estimate of the value plus what can be seen as a value bonus. The value bonus can be learned by estimating a value function on reward bonuses, propagating local uncertainties around rewards. However, this approach only increases the value bonus for an action retroactively, after seeing a higher reward bonus from that state and action. Such an approach does not encourage the agent to visit a state and action for the first time. In this work, we introduce an algorithm for exploration called Value Bonuses with Ensemble errors (VBE), that maintains an ensemble of random action-value functions (RQFs). VBE uses the errors in the estimation of these RQFs to design value bonuses that provide first-visit optimism and deep exploration. The key idea…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 5· marginally below the acceptance thresholdConfidence 3

Strengths

The paper is well written with clear motivation and discussion on the relationship between VBE and BDQN. The proposed algorithm is novel to me and is interesting. Experimentation also shows that proposed VBE performs better than SOTA algorithms.

Weaknesses

Regarding the claim that the proposed bonus “ensures that bonus goes to zero” when environment is sufficiently explored. In other UCB-stype work and the BDQN setup, theoretically, bonus will also goes to zero if actions are sufficiently explored. Overall I find this work interesting but contribution is relatively marginal, given existing algorithms including BDQN, RND, ICM [1], numerous self-supervised exploration method of this style (e.g., [1][2], to name a few), and numerous theoretical anal

Reviewer 02Rating 3· reject, not good enoughConfidence 4

Strengths

- Originality: The paper attempts to address exploration in reinforcement learning by introducing the Value Bonuses with Ensemble Errors (VBE). The use of random action-value functions (RQFs) to determine consistent rewards represents a departure from conventional ensemble-based methods in deep reinforcement learning. - Quality: While there are areas in need of further clarity, the paper provides some mathematical formulations, particularly around the stochastic ensemble reward, suggesting an e

Weaknesses

This paper describes a simple idea in a somewhat convoluted manner. Here are specific areas of concern: - Clarity and Presentation: The paper tends to obfuscate what could be explained more simply. While there is value in rigorous mathematical explanations, these should be accompanied by intuitive explanations and clearer definitions for broader accessibility. For example, the distinction between equation 2 and the actual bonus used in algorithm 1 are not clearly demarcated, leading to potentia

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

- Exploration bonuses seem underexplored as of late, especially given that RND suffers from only rewarding infrequent states and can collapse like the authors show (I have personal experience with this as well), if we are interested in settings where behavior data is not available, we will need better exploration methods - The method seems quite sample efficient in empirical evaluations, and the authors correctly note that RND takes an extremely large number of samples to converge (2 billion in

Weaknesses

- The experiments are a bit small-scale (only ~400k environment steps at most in the Atari domains) - There is no experiment in the main text on the larger domains that runs to completion, only the early exploration behavior, while I agree that early exploration behavior is more informative for our understanding, it would be good to have an example of how more frames changes behavior in more difficult settings (currently in the Appendix Figure 7, but doesn't show baselines as well) - There is li

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Explainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning