FlashNorm: Fast Normalization for Transformers

Nils Graef; Filip Makraduli; Andrew Wasielewski; Matthew Clapp

arXiv:2407.09577·cs.LG·April 28, 2026

FlashNorm: Fast Normalization for Transformers

Nils Graef, Filip Makraduli, Andrew Wasielewski, Matthew Clapp

PDF

1 Repo 6 Models

TL;DR

FlashNorm introduces a novel normalization technique for transformers that reduces latency and simplifies implementation by enabling parallel execution and eliminating normalization weights.

Contribution

It presents FlashNorm, a reformulation of RMSNorm that removes normalization weights and allows for parallel execution, improving speed and simplicity in transformer models.

Findings

01

Achieves 33-35% lower latency on NVIDIA T4 GPU at SmolLM2-135M scale.

02

Achieves 12-14% lower latency at Llama-7B scale.

03

Verifies zero-loss weight folding on three models.

Abstract

Normalization layers are ubiquitous in large language models (LLMs) yet represent a compute bottleneck: on hardware with distinct vector and matrix execution units, the RMS calculation blocks the subsequent matrix multiplication, preventing parallel execution. We present FlashNorm, an exact reformulation of RMSNorm followed by a linear layer that (i) eliminates the normalization weights by folding them into the subsequent linear layer, and (ii) defers the scalar RMS normalization to the output of the matrix multiplication, enabling the two operations to execute in parallel. Additionally, by the scale invariance of RMS, an RMSNorm followed by a linear layer followed by another RMSNorm allows the first RMSNorm to be eliminated entirely -- a mathematically identical simplification that removes the pre-attention RMSNorm in models using QKV-normalization (e.g., Gemma~4) and in MLA-models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

OpenMachine-ai/transformer-tricks
github

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.