Beyond Deep Ensembles: A Large-Scale Evaluation of Bayesian Deep Learning under Distribution Shift
Florian Seligmann, Philipp Becker, Michael Volpp, Gerhard Neumann

TL;DR
This paper systematically evaluates Bayesian deep learning methods on large-scale, real-world datasets with distribution shifts, focusing on calibration, generalization, and the effectiveness of ensembling and fine-tuning large models.
Contribution
It provides the first large-scale, systematic comparison of BDL methods on diverse tasks, including fine-tuning large pre-trained models and extending ensembles to multiple modes.
Findings
Ensembling improves generalization and calibration significantly.
Variational inference methods outperform others in accuracy during fine-tuning.
SWAG achieves the best calibration among approximate inference algorithms.
Abstract
Bayesian deep learning (BDL) is a promising approach to achieve well-calibrated predictions on distribution-shifted data. Nevertheless, there exists no large-scale survey that evaluates recent SOTA methods on diverse, realistic, and challenging benchmark tasks in a systematic manner. To provide a clear picture of the current state of BDL research, we evaluate modern BDL algorithms on real-world datasets from the WILDS collection containing challenging classification and regression tasks, with a focus on generalization capability and calibration under distribution shift. We compare the algorithms on a wide range of large, convolutional and transformer-based neural network architectures. In particular, we investigate a signed version of the expected calibration error that reveals whether the methods are over- or under-confident, providing further insight into the behavior of the methods.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMachine Learning and Data Classification · Machine Learning and Algorithms · Domain Adaptation and Few-Shot Learning
