Variational Bayesian Last Layers
James Harrison, John Willes, Jasper Snoek

TL;DR
This paper presents a deterministic variational approach for Bayesian last layer neural networks that enhances uncertainty estimation and can be integrated efficiently into standard architectures, improving accuracy and calibration.
Contribution
It introduces a novel variational Bayesian last layer (VBLL) method that is computationally efficient and improves uncertainty estimation in neural networks.
Findings
VBLL improves predictive accuracy and calibration.
VBLL enhances out-of-distribution detection.
The method is nearly free computationally when added to standard models.
Abstract
We introduce a deterministic variational formulation for training Bayesian last layer neural networks. This yields a sampling-free, single-pass model and loss that effectively improves uncertainty estimation. Our variational Bayesian last layer (VBLL) can be trained and evaluated with only quadratic complexity in last layer width, and is thus (nearly) computationally free to add to standard architectures. We experimentally investigate VBLLs, and show that they improve predictive accuracy, calibration, and out of distribution detection over baselines across both regression and classification. Finally, we investigate combining VBLL layers with variational Bayesian feature learning, yielding a lower variance collapsed variational inference method for Bayesian neural networks.
Peer Reviews
Decision·ICLR 2024 spotlight
1. The paper is well-written and easy to follow in most parts. Moreover, the work is well-motivated and I enjoyed that the authors brought back old ideas to the BDL community, e.g., using the discriminant analysis as a likelihood model. 2. I believe the exposition of the method is well done in most places, though slightly dense here and there, and helped in understanding the general idea of the proposed method. Moreover, I believe that the method is correct and an interesting contribution to the
Overall: My main concern with the paper is the weak empirical evaluation and limited novelty of the work, that is, it seems it is essentially an application of known techniques to the special case of last-layer posteriors. Comments: 1. Section 2.4 lists various related works, which I believe the author claims to optimize the log marginal via gradient descent. I have not checked every citation, but it appears to me that this statement is false for at least a subset of the cited papers. It might
BLL networks are an interesting approach to solve the scalability problem Bayesian neural networks tend to suffer from. The paper introduces another variation to this family of approaches that is relatively straightforward, easy to understand, and implement. The method is properly evaluated as the number of experimental setups is reasonably extensive both with respect to architectures and experimental tasks.
Straight-forward contributions can be seen both as a strength and as a weakness depending on the situation. They are a strength if they are an easy solution to a complex problem that might not improve upon current approaches in all situations, but most. They are a weakness if they do not provide a clear theoretical benefit above current approaches and also come without clear performance improvements. For me, the results point to the latter case as they are rather mixed despite some strong word
This is an excellent paper and a significant contribution - well done! The authors make clear how they build on the existing literature in Bayesian deep learning to create a novel advance that is practical and easy to implement. This is significant and should enable more work to push the frontier of the "best of both worlds", with neural networks serving as function approximators and Bayesian methods enabling sample-efficiency and quantification of uncertainty that is required for practical depl
Visualizations of how tight or loose the bounds in the main text could help build more intuition; comparisons in terms of speed or efficiency to variational inference algorithms that do require sampling (such as Monte-Carlo objectives like VIMCO) could also help guide practitioners in making the correct trade-off depending on FLOPs of compute available versus the required accuracy of posterior approximation/uncertainty quantification.
Code & Models
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis · Gaussian Processes and Bayesian Inference
MethodsVariational Inference
