Learning in Compact Spaces with Approximately Normalized Transformer
J\"org K.H. Franke, Urs Spiegelhalter, Marianna Nezhurina, Jenia Jitsev, Frank Hutter, Michael Hefenbrock

TL;DR
This paper introduces an approximate normalization technique for transformers that constrains parameters on a hypersphere, leading to faster convergence and better scaling without extra hyperparameters.
Contribution
It proposes a holistic approximate normalization method that removes the need for regularization and hyperparameter tuning in transformer training.
Findings
Up to 40% faster convergence compared to GPT with QK normalization.
Enables training with larger batch sizes while maintaining scaling laws.
Minimal 3% additional runtime cost for the normalization method.
Abstract
The successful training of deep neural networks requires addressing challenges such as overfitting, numerical instabilities leading to divergence, and increasing variance in the residual stream. A common solution is to apply regularization and normalization techniques that usually require tuning additional hyperparameters. An alternative is to force all parameters and representations to lie on a hypersphere. This removes the need for regularization and increases convergence speed, but comes with additional costs. In this work, we propose a more holistic, approximate normalization via simple scalar multiplications motivated by the tight concentration of the norms of high-dimensional random vectors. Additionally, instead of applying strict normalization for the parameters, we constrain their norms. These modifications remove the need for weight decay and learning rate warm-up as well, but…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModel Reduction and Neural Networks · Adversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis
