TWEO: Transformers Without Extreme Outliers Enables FP8 Training And Quantization For Dummies
Guang Liang, Jie Shao, Ningyuan Tang, Xinyao Liu, Jianxin Wu

TL;DR
TWEO introduces a simple loss function to eliminate extreme outliers in transformer training, enabling efficient FP8 training and quantization without complex modifications, significantly improving throughput and quantization performance.
Contribution
The paper reveals outliers are data-independent artifacts and proposes TWEO, a non-invasive loss function that enables FP8 training and quantization for transformers.
Findings
Reduces outliers from over 10,000 to less than 20.
Enables full-model FP8 pre-training without architectural changes.
Achieves state-of-the-art quantization performance with W8A8.
Abstract
Native FP8 support in modern hardware is essential for training large Transformers, but is severely hindered by extreme activation outliers. Existing solutions either rely on complex mixed-precision engineering or invasive architectural modifications. This paper fundamentally challenges the conventional wisdom that outliers are data-driven. We demonstrate that extreme outliers are a data-independent, mechanically-produced artifact of training, originating from specific structural properties of the weight matrices (i.e., colinearity). Based on this insight, we propose TWEO (Transformers Without Extreme Outliers), a novel, non-invasive loss function. TWEO effectively prevents extreme outliers via a very simple loss term, which reduces outliers from 10000+ to less than 20. TWEO then enables full-model FP8 pre-training with neither engineering tricks nor architectural changes for both LLM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Ferroelectric and Negative Capacitance Devices · Physical Unclonable Functions (PUFs) and Hardware Security
