A Tale of Two Variances: When Single-Seed Benchmarks Fail in Bayesian Deep Learning

Qishi Zhan; Minxuan Hu; Liang He; Guansu Wang; and Jiaxin Liu

arXiv:2604.23114·cs.LG·April 28, 2026

A Tale of Two Variances: When Single-Seed Benchmarks Fail in Bayesian Deep Learning

Qishi Zhan, Minxuan Hu, Liang He, Guansu Wang, and Jiaxin Liu

PDF

TL;DR

This paper investigates the variability in Bayesian deep learning benchmarks, revealing that single-seed evaluations can be unreliable due to variance peaks, and suggests methods for better assessment.

Contribution

It demonstrates the limitations of single-seed benchmarks in Bayesian deep learning, analyzes variance trajectories, and proposes improved evaluation practices.

Findings

01

Variance peaks can cause large estimation errors at intermediate training sizes.

02

CRPS variance correlates strongly with single-seed estimation error.

03

Replacing heteroscedastic objectives reduces variance irregularities.

Abstract

In limited-data settings, a single endpoint mean of an evaluation metric such as the Continuous Ranked Probability Score (CRPS) is itself a random variable, yet it is routinely reported as if it were a stable property of the method. We study when this practice fails. Using 50 independent repetitions across six regression datasets, we show that CRPS variance trajectories differ substantially across methods and are not always well described by a smooth power-law decay. Methods with a learned heteroscedastic variance head, namely MAP and Deep Ensembles, can develop pronounced, reproducible variance peaks at intermediate training sizes on real datasets, whereas MC Dropout and Bayes by Backprop typically show smooth variance contraction. These peaks have direct practical consequences: at the variance peak on Seoul Bike, the relative RMSE of a single-seed MAP estimate reaches 93.6\%, and the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.