Does Your Optimizer Care How You Normalize? Normalization-Optimizer Coupling in LLM Training
Abdelrahman Abouzeid (Georgia Institute of Technology)

TL;DR
This paper investigates how the interaction between normalization layers and optimizers affects large language model training, revealing significant coupling effects and proposing mitigation strategies.
Contribution
It uncovers the coupling issues between normalization functions and optimizers in LLM training and demonstrates methods to mitigate these effects.
Findings
Dynamic Erf suffers large negative interactions with Muon optimizer.
Reintroducing scale estimates recovers ~84% of the performance gap.
Adjusting Erf's alpha parameter recovers ~80% of the gap.
Abstract
In LLM training, normalization layers and optimizers are typically treated as independent design choices. In a 3x2 factorial at 1B parameters and 1000 training steps, we show this assumption can fail: Dynamic Erf (Derf; Chen & Liu, 2025) suffers a large negative interaction with Muon (Jordan, 2024), with its gap to RMSNorm growing from +0.31 nats under AdamW to +0.97 under Muon, approximately three times larger. Dynamic Tanh (DyT; Zhu et al., 2025), included as a bounded-normalizer control, shows no such penalty. Our evidence points to two failure modes of erf under Muon's faster spectral-norm growth: saturation (lossy compression) and scale blindness (discarding activation magnitude). An EMA-blend that reintroduces running scale estimates recovers ~84% of the gap. Separately, reducing Derf's alpha from its published default (0.5 to 0.3) recovers ~80% by keeping erf in its near-linear…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
