Unstable Rankings in Bayesian Deep Learning Evaluation

Qishi Zhan; Minxuan Hu; Guansu Wang; Jiaxin Liu; and Liang He

arXiv:2604.23102·cs.LG·April 28, 2026

Unstable Rankings in Bayesian Deep Learning Evaluation

Qishi Zhan, Minxuan Hu, Guansu Wang, Jiaxin Liu, and Liang He

PDF

TL;DR

This paper demonstrates that standard Bayesian deep learning evaluations are unreliable with limited data and proposes a hierarchical Bayesian framework to improve evaluation robustness across datasets.

Contribution

It introduces a Bayesian hierarchical evaluation method and a predictive detectability curve to assess the reliability of method comparisons in low-data regimes.

Findings

01

Evaluation reliability improves with dataset-specific analysis.

02

Method superiority claims can be dataset-dependent at small sample sizes.

03

Uncertainty-aware evaluation is crucial in low-data settings.

Abstract

Standard evaluations of Bayesian deep learning methods assume that metric estimates are reliable, but we show this assumption fails under data scarcity. Method rankings are not only unreliable at small $n$ , but also dataset-dependent in ways that point estimates cannot reveal: the same method comparison yields $P (MCD ≺ Ensemble) = 1.000$ at $n = 50$ on one dataset and remains below $0.95$ even at $n = 500$ on another. Across the datasets we consider, no universal sample size threshold exists, which is precisely why dataset-specific posterior inference is necessary. To address this, we use a Bayesian hierarchical model with method-specific variances to treat evaluation metrics as random variables across data realizations, and we use a predictive Minimum Detectable Difference curve to assess whether an observed gap would be detectable at a given training size. Across…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.