A Tale of Two Variances: When Single-Seed Benchmarks Fail in Bayesian Deep Learning
Qishi Zhan, Minxuan Hu, Liang He, Guansu Wang, and Jiaxin Liu

TL;DR
This paper investigates the variability in Bayesian deep learning benchmarks, revealing that single-seed evaluations can be unreliable due to variance peaks, and suggests methods for better assessment.
Contribution
It demonstrates the limitations of single-seed benchmarks in Bayesian deep learning, analyzes variance trajectories, and proposes improved evaluation practices.
Findings
Variance peaks can cause large estimation errors at intermediate training sizes.
CRPS variance correlates strongly with single-seed estimation error.
Replacing heteroscedastic objectives reduces variance irregularities.
Abstract
In limited-data settings, a single endpoint mean of an evaluation metric such as the Continuous Ranked Probability Score (CRPS) is itself a random variable, yet it is routinely reported as if it were a stable property of the method. We study when this practice fails. Using 50 independent repetitions across six regression datasets, we show that CRPS variance trajectories differ substantially across methods and are not always well described by a smooth power-law decay. Methods with a learned heteroscedastic variance head, namely MAP and Deep Ensembles, can develop pronounced, reproducible variance peaks at intermediate training sizes on real datasets, whereas MC Dropout and Bayes by Backprop typically show smooth variance contraction. These peaks have direct practical consequences: at the variance peak on Seoul Bike, the relative RMSE of a single-seed MAP estimate reaches 93.6\%, and the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
