Spherical Perspective on Learning with Normalization Layers
Simon Roburin, Yann de Mont-Marin, Andrei Bursuc, Renaud Marlet,, Patrick P\'erez, Mathieu Aubry

TL;DR
This paper presents a geometric spherical framework to analyze how normalization layers influence neural network training dynamics, revealing that SGD with NLs behaves like a constrained Adam optimizer on a hypersphere.
Contribution
It introduces a novel spherical geometric perspective for understanding NLs, deriving Adam's effective learning rate, and showing SGD with NLs is equivalent to a constrained Adam variant.
Findings
Derived the effective learning rate expression for Adam.
Proved SGD with NLs is equivalent to a hypersphere-constrained Adam.
Validated phenomena related to Adam variants through experiments.
Abstract
Normalization Layers (NLs) are widely used in modern deep-learning architectures. Despite their apparent simplicity, their effect on optimization is not yet fully understood. This paper introduces a spherical framework to study the optimization of neural networks with NLs from a geometric perspective. Concretely, the radial invariance of groups of parameters, such as filters for convolutional neural networks, allows to translate the optimization steps on the unit hypersphere. This formulation and the associated geometric interpretation shed new light on the training dynamics. Firstly, the first effective learning rate expression of Adam is derived. Then the demonstration that, in the presence of NLs, performing Stochastic Gradient Descent (SGD) alone is actually equivalent to a variant of Adam constrained to the unit hypersphere, stems from the framework. Finally, this analysis…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Model Reduction and Neural Networks · Image and Signal Denoising Methods
MethodsBatch Normalization · Stochastic Gradient Descent · Adam
