TL;DR
This paper investigates whether microcanonical Langevin dynamics can effectively utilize mini-batch gradient noise, addressing scalability issues in Bayesian deep learning inference methods.
Contribution
It provides the first systematic theoretical analysis of stochastic-gradient microcanonical dynamics and introduces novel techniques to improve scalability and robustness.
Findings
Identifies bias due to anisotropic gradient noise and numerical instabilities in high dimensions.
Proposes a gradient noise preconditioning scheme that reduces bias.
Develops an adaptive tuner for step size and numerical stability, enabling scalable sampling.
Abstract
Scaling inference methods such as Markov chain Monte Carlo to high-dimensional models remains a central challenge in Bayesian deep learning. A promising recent proposal, microcanonical Langevin Monte Carlo, has shown state-of-the-art performance across a wide range of problems. However, its reliance on full-dataset gradients makes it prohibitively expensive for large-scale problems. This paper addresses a fundamental question: Can microcanonical dynamics effectively leverage mini-batch gradient noise? We provide the first systematic study of this problem, establishing a novel continuous-time theoretical analysis of stochastic-gradient microcanonical dynamics. We reveal two critical failure modes: a theoretically derived bias due to anisotropic gradient noise and numerical instabilities in complex high-dimensional posteriors. To tackle these issues, we propose a principled gradient noise…
Peer Reviews
Decision·Submitted to ICLR 2026
The techniques of the paper jointly resolve anisotropic noise bias and numerical instability, enabling scalable and robust Bayesian inference. The proposed SMILE and pSMILE methods show theoretical grounding and consistent empirical gains over SGHMC across Bayesian neural networks, image classification, and language modeling tasks.
1. Although the paper benchmarks against strong baselines such as scale-adapted SGHMC, it omits comparison with several strong adaptive SGMCMC methods, e.g., SGFS (Ahn et al., 2012), pSGLD (Li et al., 2016), MSGLD (Kim et al., 2020), cyclical SGMCMC (Zhang et al., 2020), MAMBA (Coullon et al., 2021), and PX-SGMCMC (Kim et al., 2025). Especially, cyclical SGMCMC is a strong baseline in this field. Including these would clarify whether the improvements stem from the microcanonical formulation itse
The paper addresses an important problem. • The paper demonstrates that the naïve version of MCLMC, which uses stochastic gradients instead of full-batch gradients, fails to achieve good results. This variant is referred to as SMILE-naïve. • To address this issue, the authors propose a gradient noise preconditioning scheme. • Prior work by Robnik & Seljak (2024) has shown that the stationary distribution of MCLMC equals the target posterior 𝑝(theta|D) for any injected isotropic noise. In App
• The first three pages consist mainly of the introduction and discussion of related work. While this section is useful and well written, unfortunately, there isn’t enough space left in the main paper to adequately present the details of the new contributions. • Perhaps because of the limited space in the main paper, I found the presentation a bit dense and at times difficult to follow. Some examples of missing details: • Line 92: “theta(k,s), k \in [K], s \in [S]} from K independent chains”.
- Just as methods such as SGLD and pSGLD were developed from the LMC, it seems natural and well-motivated that a similar line of research would emerge for the MCLMC as well. - Section 3 is easy to follow, and the overall procedure seems sound. The Gaussian assumption for gradient noise is quite common, and the diagonal preconditioning is also constructed in a typical moving-average fashion, as seen in many prior works. The small-scale experiments also provide adequate empirical support.
- In essence, the preconditioning introduced in Section 3 aims to ensure that the stationary distribution matches the target posterior under anisotropic mini-batch noise. However, if we proceed as described in Section 4, it seems that the stationary distribution may no longer coincide with the target posterior; the guardrail-induced forced reversion could potentially disrupt proper posterior sampling. - Table 3 includes Laplace and IVON baselines. Are those with K=1, S=8? If SGHMC, SMILE, and p
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMarkov Chains and Monte Carlo Methods · Generative Adversarial Networks and Image Synthesis · Gaussian Processes and Bayesian Inference
