When Does Removing LayerNorm Help? Activation Bounding as a Regime-Dependent Implicit Regularizer
Lucky Verma

TL;DR
Dynamic Tanh (DyT) replaces LayerNorm with activation bounding, acting as a regime-dependent implicit regularizer that benefits small models but can hinder larger ones, with effects measurable through saturation and interventions.
Contribution
This work characterizes the regime-dependent effects of activation bounding as an implicit regularizer, providing insights into when removing LayerNorm helps or harms model performance.
Findings
DyT improves small model validation loss by 27.3% but worsens it by 18.8% at larger scales.
Activation saturation correlates with DyT's performance impact, with 49% saturation at 1M and 23% at 118M.
Interventions like HardTanh and alpha adjustments support the bounding mechanism explanation.
Abstract
Dynamic Tanh (DyT) removes LayerNorm by bounding activations with a learned tanh(alpha x). We show that this bounding is a regime-dependent implicit regularizer, not a uniformly beneficial replacement. Across GPT-2-family models spanning 64M to 3.78B parameters and 1M to 118M tokens, with Llama and ViT cross-checks, DyT improves validation loss by 27.3% at 64M/1M but worsens it by 18.8% at 64M/118M; the 1M benefit vanishes with capacity (+1.7% at 3.78B), while the 118M penalty reaches +27.9%. The mechanism is measurable: 49% of DyT activations saturate at 1M versus 23% at 118M, and a 500-step saturation heuristic classifies DyT's sign with 75% raw in-sample accuracy on the 12-cell GPT-2 calibration set (AUC 0.75; 64% when adding Scale 5 stress cells), correctly labels 3/3 Llama checks, but only reaches 50% raw leave-one-scale-out accuracy. Three interventions support the bounding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
