TL;DR
IBNorm is a normalization technique inspired by the Information Bottleneck principle that enhances the informativeness of learned representations while maintaining training stability, outperforming traditional variance-centric methods.
Contribution
This paper introduces IBNorm, a novel normalization method based on the Information Bottleneck, which improves representation quality by balancing information preservation and suppression.
Findings
IBNorm achieves higher IB values than variance-centric methods.
Empirical results show IBNorm outperforms BatchNorm, LayerNorm, and RMSNorm.
Mutual information analysis confirms better information bottleneck behavior.
Abstract
Normalization is fundamental to deep learning, but existing approaches such as BatchNorm, LayerNorm, and RMSNorm are variance-centric by enforcing zero mean and unit variance, stabilizing training without controlling how representations capture task-relevant information. We propose IB-Inspired Normalization (IBNorm), a simple yet powerful family of methods grounded in the Information Bottleneck principle. IBNorm introduces bounded compression operations that encourage embeddings to preserve predictive information while suppressing nuisance variability, yielding more informative representations while retaining the stability and compatibility of standard normalization. Theoretically, we prove that IBNorm achieves a higher IB value and tighter generalization bounds than variance-centric methods. Empirically, IBNorm consistently outperforms BatchNorm, LayerNorm, and RMSNorm across…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper introduces an application of the information bottleneck to normalization. By integrating compression operations into the normalization process, the paper presents a significant departure from traditional methods that focus solely on statistical normalization (mean and variance). 2. The paper presents a clear theoretical foundation and solid empirical validation. The authors present rigorous proofs that IBNorm achieves a higher IB value compared to variance-centric methods and demons
1. Limited contribution in context of existing IB works. In most work based on the Information Bottleneck principle, compression operations (such as nonlinear functions like tanh [1] and kernel-based function [2], or explicit compression losses like VIB [3] and SPC [4]) are often applied alongside normalization operations within neural networks. The paper claims to introduce an IB perspective to normalization, but it essentially introduces an additional compression operation into networks alread
1. The paper provides an theoretical viewpoint by integrating the information bottleneck principle into normalization design, even though inspired from the previous paper (NormalNorm, ICML 2025), contributing to a deeper understanding of how normalization can regulate information flow in representation learning. 2. The information-theoretic reformulation of normalization is well motivated, and the overall structure and derivation are easy to follow. 3. Empirical validation on multiple model
1. The overall idea and contribution is incremental, especially the writing is mostly following the presentation of the previous paper (NormalNorm, ICML 2025), e.g., the Information Bottleneck idea of normalization and the description of algorithm (NormalNorm, ICML 2025). Besides, this paper misses a bunch of papers to discuss and compare., e.g., the whitening method in normal supervised learning [1,2] and self-superverses learning [3,4]. I think this paper should compare to IterNorm [2] meth
* The paper clearly identifies a potential limitation of existing normalization methods (focusing solely on moments) and proposes addressing it through the principled lens of the Information Bottleneck. * The method is evaluated across different domains (NLP, CV), architectures (Transformers, CNNs), and model scales, providing a broad assessment of its practical performance. * IBNorm seems to outperform standard baselines (LN, BN, RMSNorm) and a relevant recent competitor (NormalNorm) across mos
The paper's primary weaknesses lie in the lack of sound theoretical justification for its central claims and unfair empirical comparisons in the vision domain. 1. **Unfair Empirical Comparison in Vision Experiments:** The claimed empirical superiority in vision models (Table 3) is based on an unfair comparison. Appendix F.2 reveals that the authors used different hyperparameters (learning rates, weight decays) for their proposed IBNorm method compared to the baseline methods (including BatchNo
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
