The Hidden Power of Normalization Layers in Neural Networks: Exponential Capacity Control
Khoat Than

TL;DR
This paper provides a theoretical explanation for how normalization layers in neural networks control capacity exponentially, leading to improved training stability and generalization, which explains their empirical success.
Contribution
It introduces a capacity control framework showing normalization layers exponentially reduce Lipschitz constants, explaining their role in optimization and generalization.
Findings
Normalization layers exponentially reduce Lipschitz constants.
They smooth the loss landscape exponentially.
They constrain network capacity, improving generalization.
Abstract
Normalization layers are critical components of modern AI systems, such as ChatGPT, Gemini, DeepSeek, etc. Empirically, they are known to stabilize training dynamics and improve generalization ability. However, the underlying theoretical mechanism by which normalization layers contribute to both optimization and generalization remains largely unexplained, especially when using many normalization layers in a deep neural network (DNN). In this work, we develop a theoretical framework that elucidates the role of normalization through the lens of capacity control. We prove that an unnormalized DNN can exhibit exponentially large Lipschitz constants with respect to either its parameters or inputs, implying excessive functional capacity and potential overfitting. Such bad DNNs are uncountably many. In contrast, the insertion of normalization layers provably can reduce the Lipschitz constant…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Machine Learning in Materials Science · Neural Networks and Reservoir Computing
