Bayesian Optimization via Continual Variational Last Layer Training
Paul Brunzema, Mikkel Jordahn, John Willes, Sebastian Trimpe, Jasper, Snoek, James Harrison

TL;DR
This paper introduces a novel online training method for variational Bayesian last layer neural networks, which outperform Gaussian processes and other Bayesian neural networks on complex correlation tasks, offering a scalable alternative for Bayesian optimization.
Contribution
The paper presents a new online training algorithm for variational Bayesian last layer models, connecting them to Gaussian process conditioning, and demonstrating improved performance on complex tasks.
Findings
VBLL networks outperform GPs on complex correlation tasks.
VBLLs match GPs on benchmark tasks.
Proposed method is efficient and scalable for online training.
Abstract
Gaussian Processes (GPs) are widely seen as the state-of-the-art surrogate models for Bayesian optimization (BO) due to their ability to model uncertainty and their performance on tasks where correlations are easily captured (such as those defined by Euclidean metrics) and their ability to be efficiently updated online. However, the performance of GPs depends on the choice of kernel, and kernel selection for complex correlation structures is often difficult or must be made bespoke. While Bayesian neural networks (BNNs) are a promising direction for higher capacity surrogate models, they have so far seen limited use due to poor performance on some problem types. In this paper, we propose an approach which shows competitive performance on many problem types, including some that BNNs typically struggle with. We build on variational Bayesian last layers (VBLLs), and connect training of…
Peer Reviews
Decision·ICLR 2025 Spotlight
- The method is well-explained and is theoretically justified, and there are additional modifications which can be made to increase the efficiency such as feature re-use and sparse full model retraining. This flexibility enables practitioners to balance the tradeoff between model performance and computational cost. - The authors use a diverse setting of test objectives, specifically demonstrating performance on instances with high-dimensionality and non-stationarity. - VBLL appears to be robust
- Although it appears that one of the primary motivations behind this work is the increased efficiency compared to other BNN surrogates, there is no measure of runtime or computational cost within the paper. It would be helpful to understand how these methods perform as a function of computational budget. This could also help clarify the difference in performances between VBLL and VBLL CL. - There is also currently no demonstration of why this approximation would be preferred over the using the
The paper combines two well studied ideas in an elegant way and the presentation is (relatively) easy to follow (though I would have wished a bit more emphasis on the natural parameterization of the Normal distributions as this is key to the computational efficiency). The empirical studies are extensive and well discussed.
One aspect that is disregarded by the paper is how to chose the network architecture for all but the last-layer; I have no idea how sensitive the quality of the proposed approach is to this. In essence, the complexity of choosing a kernel function for GPs has been shifted to the network architecture of the underlying neural network. This is not discussed in sufficient detail. Also, only at the end the difference to Laplace approximation of the last layer is discussed; I would have expected this
The paper is well-written and a pleasure to read. The problem statement is clear from the outset, and the connections to related work are extensive. I also appreciated how the paper’s focus on practical aspects such as improving training efficiency via continual learning. Having the method implemented in BoTorch is also appealing to practitioners wanting to experiment using this method in real-world settings. The experiments demonstrate that VBLL performs well in the targeted settings having c
1. Deep kernel learning was the first method to come to mind when reading the motivation for this work. While I appreciated its inclusion in the experimental section, I would have liked more discussion in the earlier sections on why DKL might be less ideal than VBLL. To my understanding, DKL’s computational complexity, especially in high-dimensional settings, might be a key differentiator, but additional detail on this would help clarify VBLL’s practical advantages right from the outset. 2. Whil
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTarget Tracking and Data Fusion in Sensor Networks · Gaussian Processes and Bayesian Inference
MethodsGreedy Policy Search
