TL;DR
FlashNorm introduces a novel normalization technique for transformers that reduces latency and simplifies implementation by enabling parallel execution and eliminating normalization weights.
Contribution
It presents FlashNorm, a reformulation of RMSNorm that removes normalization weights and allows for parallel execution, improving speed and simplicity in transformer models.
Findings
Achieves 33-35% lower latency on NVIDIA T4 GPU at SmolLM2-135M scale.
Achieves 12-14% lower latency at Llama-7B scale.
Verifies zero-loss weight folding on three models.
Abstract
Normalization layers are ubiquitous in large language models (LLMs) yet represent a compute bottleneck: on hardware with distinct vector and matrix execution units, the RMS calculation blocks the subsequent matrix multiplication, preventing parallel execution. We present FlashNorm, an exact reformulation of RMSNorm followed by a linear layer that (i) eliminates the normalization weights by folding them into the subsequent linear layer, and (ii) defers the scalar RMS normalization to the output of the matrix multiplication, enabling the two operations to execute in parallel. Additionally, by the scale invariance of RMS, an RMSNorm followed by a linear layer followed by another RMSNorm allows the first RMSNorm to be eliminated entirely -- a mathematically identical simplification that removes the pre-attention RMSNorm in models using QKV-normalization (e.g., Gemma~4) and in MLA-models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗open-machine/FlashNormmodel· ♡ 2♡ 2
- 🤗drewwas/OpenMachine_FlashNormmodel· 1 dl1 dl
- 🤗open-machine/Qwen3-1.7B-FlashNormmodel· 700 dl· ♡ 1700 dl♡ 1
- 🤗open-machine/Llama-3.2-1B-FlashNormmodel· 640 dl· ♡ 1640 dl♡ 1
- 🤗open-machine/Qwen3-8B-FlashNormmodel· 756 dl· ♡ 2756 dl♡ 2
- 🤗open-machine/Llama-3.1-8B-FlashNormmodel· 782 dl· ♡ 2782 dl♡ 2
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
