Leveraging Flatness to Improve Information-Theoretic Generalization Bounds for SGD
Ze Peng, Jian Zhang, Yisen Wang, Lei Qi, Yinghuan Shi, Yang Gao

TL;DR
This paper introduces a new information-theoretic generalization bound that leverages flatness in SGD, resulting in tighter bounds that better reflect the model's generalization performance, especially in deep neural networks.
Contribution
It derives a flatness-aware IT bound for SGD, improving the understanding of how flatness influences generalization and providing tighter, more accurate bounds.
Findings
Bound reflects improved flatness correlates with better generalization.
Experiments show the bound is numerically tighter and aligns with empirical results.
Application to convex problems improves convergence rates from Ω(1) to O(1/√n).
Abstract
Information-theoretic (IT) generalization bounds have been used to study the generalization of learning algorithms. These bounds are intrinsically data- and algorithm-dependent so that one can exploit the properties of data and algorithm to derive tighter bounds. However, we observe that although the flatness bias is crucial for SGD's generalization, these bounds fail to capture the improved generalization under better flatness and are also numerically loose. This is caused by the inadequate leverage of SGD's flatness bias in existing IT bounds. This paper derives a more flatness-leveraging IT bound for the flatness-favoring SGD. The bound indicates the learned models generalize better if the large-variance directions of the final weight covariance have small local curvatures in the loss landscape. Experiments on deep neural networks show our bound not only correctly reflects the better…
Peer Reviews
Decision·ICLR 2025 Poster
1. This paper introduces novel information-theoretic generalization bounds for SGD that improve upon previous results, particularly when SGD finds flat minima. 2. Constructing the auxiliary weight process twice, specifically through the omniscient trajectory, is a notable technical contribution in this field. 3. The results provide an explanation for the learnability of some SCO problems, where many previous information-theoretic bounds do not apply (e.g., non-vanishing bounds). 4. The paper
1. Some related works, whether cited or uncited in this paper, require further discussion in relation to these results. For example, [R2] also discusses limitations in [R1] and suggests that, if the SGD process can be well-approximated by SDEs (motivated by empirical observations), the auxiliary process could be the SDE approximation of SGD, with perturbed Gaussian noise depending on the training data and current state (i.e. depending on $S$ and $W_{t-1}$ at step $t$). Moreover, while not direct
-The contribution is novel and significant: The "omniscient" auxiliary trajectory is original and allows to overcome 2 limitations of previous information-theoretic generalization bounds: 1) correlation between the bounds and observed generalization behavior under different batch sizes 2) $O(\frac{1}{\sqrt{n}})$ rate for Gradient Descent on convex-Lipschitz-Bounded problems. -The paper effectively motivates the problem and provides a useful high-level overview of the main ideas in the introduc
1) The bound requires an additional penalty term to be introduced. This additional term could worsen the bound in some applications if the benefits to the trajectory term are not high enough. Also, the proof involves two auxiliary trajectories (omniscient and SGLD-like) instead of one, complicating the proof and the bound. 2) In the introduction, it is postulated that the incorrect behavior of the previous bound in regards to varying batch sizes comes from the trajectory term not capturing the
Overall, this is a solid and well-written paper. The motivation and intuition behind the proposed methodology are well-explained. It provides a refinement for the information-theoretic generalization bound by exploiting the geometric properties of the objective function and by devising a more powerful auxiliary sequence. This allows to better reflect empirical pbeservations and recovers the minimax rate when applying to CLB-based bounds. In this regard, this paper takes a step further towards th
Despite a rather interesting paper, some concerns arise across different aspects. First, the major concern comes from the novelty. The paper should be regarded as a refinement over [Wang and Mao, 2021] and the generalization bound is not new and decisively better compared to the prior results. Also, the methodology is also not completely original. For example, the use of anisotropic Gaussian perturbation also appears in previous works like [Neu et al., 2021], as mentioned by the authors themsel
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning · Advanced Graph Neural Networks
