Gated Normalization Removal and Scale Anchoring in Pre-Norm Transformers
Andrei Kanavalau, Carmen Amo Alonso, Sanjay Lall

TL;DR
This paper introduces a gated normalization removal method for pre-norm transformers, revealing the importance of final normalization for scale anchoring and improving decoding throughput.
Contribution
It proposes TaperNorm, a gradual tapering of normalization layers to learn sample-independent maps, enabling norm-free inference and increased decoding speed.
Findings
Internal normalization can be tapered with minimal validation-loss increase.
Final normalization anchors the scale of pre-logit representations.
Tapering internal norms improves decoding throughput by up to 1.18x.
Abstract
Normalization layers are standard in transformers, but it is not clear whether their sample-dependent computations are necessary throughout both training and inference. This work develops a gated normalization-removal approach for pre-norm transformers. The approach is implemented using TaperNorm, which starts from standard RMSNorm/LayerNorm and gradually tapers to learned sample-independent linear or affine maps. Once the gate reaches zero, per-token statistics are no longer computed in the tapered layers and the resulting maps can be folded into adjacent linear projections. The results indicate that internal normalization can be tapered in the tested pre-training and fine-tuning settings with small validation-loss increases. Our approach helps reveal a distinct role for final normalization, namely that it anchors the scale of the pre-logit representation. With this anchor present,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Memory and Neural Computing · Neural Networks and Reservoir Computing · Machine Learning in Materials Science
