Where You Place the Norm Matters: From Prejudiced to Neutral Initializations

Emanuele Francazi; Francesco Pinto; Aurelien Lucchi; Marco Baity-Jesi

arXiv:2505.11312·cs.LG·April 3, 2026

Where You Place the Norm Matters: From Prejudiced to Neutral Initializations

Emanuele Francazi, Francesco Pinto, Aurelien Lucchi, Marco Baity-Jesi

PDF

TL;DR

This paper theoretically analyzes how the choice and placement of normalization layers in neural networks influence initial prediction distributions, affecting training dynamics and model behavior.

Contribution

It provides a theoretical framework linking normalization design choices to initial prediction regimes, guiding more principled network architecture decisions.

Findings

01

Normalization choice and placement determine initial prediction bias or neutrality.

02

Architectural decisions induce systematic shifts in initial prediction regimes.

03

Results clarify how common normalization variants influence early training behavior.

Abstract

Normalization layers were introduced to stabilize and accelerate training, yet their influence is critical already at initialization, where they shape signal propagation and output statistics before parameters adapt to data. In practice, both which normalization to use and where to place it are often chosen heuristically, despite the fact that these decisions can qualitatively alter a model's behavior. We provide a theoretical characterization of how normalization choice and placement (Pre-Norm vs. Post-Norm) determine the distribution of class predictions at initialization, ranging from unbiased (Neutral) to highly concentrated (Prejudiced) regimes. We show that these architectural decisions induce systematic shifts in the initial prediction regime, thereby modulating subsequent learning dynamics. By linking normalization design directly to prediction statistics at initialization, our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.