Flat Posterior Does Matter For Bayesian Model Averaging
Sungjun Lim, Jeyoon Yeom, Sooyon Kim, Hoyoon Byun, Jinho Kang, Yohan Jung, Jiyoung Jung, Kyungwoo Song

TL;DR
This paper investigates the importance of posterior flatness in Bayesian neural networks for model averaging, revealing that encouraging flat posteriors enhances generalization and proposing a new method, FP-BMA, to achieve this.
Contribution
The paper introduces FP-BMA, a novel training objective that explicitly promotes flat posteriors in Bayesian neural networks, improving generalization performance.
Findings
Most approximate Bayesian inference methods do not produce flat posteriors.
Flat posteriors lead to better generalization in Bayesian model averaging.
FP-BMA effectively captures flat posteriors and enhances downstream task performance.
Abstract
Bayesian neural networks (BNNs) estimate the posterior distribution of model parameters and utilize posterior samples for Bayesian Model Averaging (BMA) in prediction. However, despite the crucial role of flatness in the loss landscape in improving the generalization of neural networks, its impact on BMA has been largely overlooked. In this work, we explore how posterior flatness influences BMA generalization and empirically demonstrate that (1) most approximate Bayesian inference methods fail to yield a flat posterior and (2) BMA predictions, without considering posterior flatness, are less effective at improving generalization. To address this, we propose Flat Posterior-aware Bayesian Model Averaging (FP-BMA), a novel training objective that explicitly encourages flat posteriors in a principled Bayesian manner. We also introduce a Flat Posterior-aware Bayesian Transfer Learning scheme…
Peer Reviews
Decision·UAI 2025 Poster
- This paper targets the generalization of BNNs, which is an important problem. - The paper provides empirical and theoretical analysis to support the need for flatness in BNNs.
- The overall goal of the paper is vague. As far as I understand, the proposed method increases the flatness of the variational parameter \theta, not the model parameter w. However, the literature shows flatter w leads to better generalization. The seems to be a gap. The meaning of "flatness in BNNs" is not very clear in the paper. - Previous works have demonstrated the benefits of including flatness in BNNs, e.g. Möllenhoff & Khan, 2022, Nguyen et al., 2023, Li & Zhang, 2023. The additional ins
- The connection between the proposed objective and existing works is well-analyzed - Well written and easy to follow
- In Bayesian deep learning in the end we have a distribution, here the authors use the averaged Hessian eigenvalues of different sampled weights as the measurement of flatness. I'm not fully convinced this is a good measurement of a flatness over a distribution. - The proposed objective is expensive to train.
1. I notice that this is a resubmission paper. Compared with the last version, more analysis on the flatness of the loss landscape and the relations between flatness and general performances are included. I respect the authors' efforts in studying the geometry of loss landscape. 2. The empirical analysis using Hessian eigenvalues clearly demonstrates why finding flat modes is important to the overall performance. 3. Comprehensive experiments are conducted to demonstrate the effectiveness of SA
1. The experiments on real-world datasets are limited to CIFAR10/100. I expect to see results on large-scale dataset like ImageNet to show the scalability of SA-BMA. 2. Figure 5 may lead to a misunderstanding that PTL and SA-BMA change the loss surface (in the first 2 figures).
The paper does propose an interesting combination of lines of work in deep learning, it missed out on evaluating whether this combination is useful in my opinion, though. I do see the plots in Figure 2 as a negative result in this way, and think based on this one could have written an interesting paper on flatness-seeking methods approximating Bayesian averages.
- I do not think there is a need for flatness-aware optimization in Bayesian models. That is because Bayesian models are building an average over all models with high likelihood (or posterior likelihood for informative priors). Taking this average will naturally lead to including a lot of models from flat optima, as they are simply wider and thus have more mass (in the prior). This in my opinion is underlined by the experiments in Figure 2b-c, where we can see that by simply using a larger Ensem
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGaussian Processes and Bayesian Inference · Domain Adaptation and Few-Shot Learning
