Entropy-MCMC: Sampling from Flat Basins with Ease
Bolian Li, Ruqi Zhang

TL;DR
Entropy-MCMC introduces a novel sampling method that biases towards flat, well-generalizing modes in the posterior of deep neural networks, improving performance in Bayesian deep learning tasks.
Contribution
The paper proposes a new flat-basin biased MCMC sampling technique using an auxiliary guiding variable, with proven convergence and faster mixing than existing methods.
Findings
Successfully samples from flat posterior basins
Outperforms baselines on classification, calibration, OOD detection
Converges faster than existing flatness-aware methods
Abstract
Bayesian deep learning counts on the quality of posterior distribution estimation. However, the posterior of deep neural networks is highly multi-modal in nature, with local modes exhibiting varying generalization performance. Given a practical budget, targeting at the original posterior can lead to suboptimal performance, as some samples may become trapped in "bad" modes and suffer from overfitting. Leveraging the observation that "good" modes with low generalization error often reside in flat basins of the energy landscape, we propose to bias sampling on the posterior toward these flat regions. Specifically, we introduce an auxiliary guiding variable, the stationary distribution of which resembles a smoothed posterior free from sharp modes, to lead the MCMC sampler to flat basins. By integrating this guiding variable with the model parameter, we create a simple joint distribution that…
Peer Reviews
Decision·ICLR 2024 poster
I think the idea of the augmented model is interesting and the paper is a nice read. These considerations are novel to the best of my knowledge. The manuscript is mostly very clear. The proposed method achieves good empirical performance.
I think the authors should clearly specify their model: prior distributions and likelihood and only after that move to the inference part to improve the clarity of the paper. I understand that when priors are uniform, the RHS of Eqn 4 effectively defines the likelihood of the augmented model the authors want to consider. The notation using $f(\theta)$ is confusing (e.g. because of no dependence on data) and should be replaced by substituting the definition above the Eq 3. No line numbers in t
* The paper is easy to understand and well written. * The method resolves a key computational limitation in the local entropy approach from Chaudhari et. al 2019 and allows local entropy optimization with no inner loop, and essentially the same computational cost as SGD. This could make local entropy optimization much more appealing to practitioners than current methods. * Synthetic dataset experiments and measurement of flatness metrics corroborate the claims of the method's ability to focus it
* The theoretical results focus on a very restricted case of strong convexity. Although analysis of this situation provides interesting context for the relative abilities of the proposed method and existing methods, nothing can be firmly concluded in realistic settings. * The SGD baselines for the classification experiments are somewhat weak. It would be interesting to see if the proposed method can push the performance of models with state of the art scores, or at least much closer to state of
- The paper is well organized, well written, and presents some advances in the field. - The 2 replicas framework, although very similar (actually a specific instance with y=1 replicas) of the Robust Ensemble (RE) introduced in Baldassi et al. PNAS '16, has the advantage over the generic RE of preserving an unbiased marginal measure, while in the RE with y>1 replicas the resulting marginals are tilted. - There is good experimental coverage showing consistent (although small) improvements
- There is a lack of novelty with respect to Baldassi et al. PNAS '16, only partly justified by the focus on the Bayesian setting. - All experiments seem to be performed using a temperature of T=1e-4, instead of the T=1 of the purely Bayesian setting. This makes the Entropy-MCM framework even more similar to the optimization setting of Baldassi et al. PNAS '16 and Pittorino et al ICLR '21. Since table 4 shows only minimal performance decrease when setting the temperature to 1, I suggest to
Code & Models
Videos
Taxonomy
TopicsHydrocarbon exploration and reservoir analysis · Seismic Imaging and Inversion Techniques
