Universal Value-Function Uncertainties

Moritz A. Zanger; Max Weltevrede; Yaniv Oren; Pascal R. Van der Vaart; Caroline Horsch; Wendelin B\"ohmer; Matthijs T. J. Spaan

arXiv:2505.21119·cs.LG·June 3, 2025

Universal Value-Function Uncertainties

Moritz A. Zanger, Max Weltevrede, Yaniv Oren, Pascal R. Van der Vaart, Caroline Horsch, Wendelin B\"ohmer, Matthijs T. J. Spaan

PDF

Open Access 3 Reviews

TL;DR

This paper introduces UVU, a novel method for estimating value-function uncertainty in reinforcement learning that is computationally efficient and theoretically grounded, performing comparably to ensembles in offline RL tasks.

Contribution

UVU provides a simple, theoretically justified approach to quantify value uncertainty using prediction errors against a fixed network, reducing computational costs compared to ensembles.

Findings

01

UVU errors are equivalent to ensemble variance in the infinite-width limit.

02

UVU achieves similar performance to large ensembles in offline RL tasks.

03

UVU offers a computationally efficient alternative to ensemble methods.

Abstract

Estimating epistemic uncertainty in value functions is a crucial challenge for many aspects of reinforcement learning (RL), including efficient exploration, safe decision-making, and offline RL. While deep ensembles provide a robust method for quantifying value uncertainty, they come with significant computational overhead. Single-model methods, while computationally favorable, often rely on heuristics and typically require additional propagation mechanisms for myopic uncertainty estimates. In this work we introduce universal value-function uncertainties (UVU), which, similar in spirit to random network distillation (RND), quantify uncertainty as squared prediction errors between an online learner and a fixed, randomly initialized target network. Unlike RND, UVU errors reflect policy-conditional value uncertainty, incorporating the future uncertainties any given policy may encounter.…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

1. Improving the computational efficiency of epistemic uncertainty quantification is a key objective for RL algorithms. 2. The paper provides a formal proof, grounded in Neural Tangent Kernel (NTK) theory, that its single-network uncertainty estimate is mathematically equivalent to the variance of an infinite ensemble, moving it beyond a simple heuristic methods like Random Network Distillation (RND). 3. The policy-conditioned nature of the value function enables UVU to capture long-term, plan

Weaknesses

**Lack of empirical results** A potential weakness is the paper's limited evaluation of the intrinsic quality of the uncertainty estimates. The experiments focus on a binary task-rejection benchmark, which demonstrates the utility of the uncertainty but doesn't thoroughly analyze its calibration, its correlation with true value error across different states, or how it behaves in simpler, more controlled environments. 1. The paper compares its approach to Random Network Distillation (RND) and

Reviewer 02Rating 6Confidence 2

Strengths

+ The paper is crafted very well, with high-quality writing, figures, and appendices + The topic of uncertainty estimation for RL (including for critics/value functions) remains relevant, with no widely accepted approaches. Existing techniques like deep ensembles or MC Dropout are available, but are not widespread (e.g., due to the additional computational demands or potential training instability), and MC Dropout in particular has been shown to provide mixed results + The motivation and prelimi

Weaknesses

- The empirical analysis is somewhat limited (although it is an illustrative use case). I would love to see some additional experiments beyond this scenario. As the method is generalizable, I feel it should be possible to apply it to common benchmark environments (similar to the original RND paper) - The lack of any implementation limits impact, and also makes it difficult to fully assess reproducibility.

Reviewer 03Rating 4Confidence 3

Strengths

The paper aims to address an important issue in RL, and the theoretical analysis is strong.

Weaknesses

The central concern with the paper is the numerical study. Suppose that the true values are known for the simulated environment, the experiments should include coverage rates (and related calibration metrics). Reporting only the best policy’s returns with confidence intervals does not fully characterize estimator performance or uncertainty.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSimulation Techniques and Applications