Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language Models
Jungwoo Park, Taewhoo Lee, Chanwoong Yoon, Hyeon Hwang, Jaewoo Kang

TL;DR
This paper introduces Outlier-Safe Pre-Training (OSP), a novel training approach that prevents activation outliers in large language models, enabling more effective 4-bit quantization and efficient deployment.
Contribution
OSP combines three innovations to proactively prevent outliers during training, allowing large language models to be quantized efficiently without post-hoc mitigation.
Findings
First production-scale outlier-free LLM trained on 1 trillion tokens.
Achieves 35.7 average score on 10 benchmarks under 4-bit quantization.
Near-zero excess kurtosis (0.04) indicating drastically reduced outliers.
Abstract
Extreme activation outliers in Large Language Models (LLMs) critically degrade quantization performance, hindering efficient on-device deployment. While channel-wise operations and adaptive gradient scaling are recognized causes, practical mitigation remains challenging. We introduce Outlier-Safe Pre-Training (OSP), a practical guideline that proactively prevents outlier formation rather than relying on post-hoc mitigation. OSP combines three key innovations: (1) the Muon optimizer, eliminating privileged bases while maintaining training efficiency; (2) Single-Scale RMSNorm, preventing channel-wise amplification; and (3) a learnable embedding projection, redistributing activation magnitudes originating from embedding matrices. We validate OSP by training a 1.4B-parameter model on 1 trillion tokens, which is the first production-scale LLM trained without such outliers. Under aggressive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗dmis-lab/OSP-1.4B-100B-Adammodel· 11 dl· ♡ 311 dl♡ 3
- 🤗dmis-lab/OSP-1.4B-100B-Muonmodel· 5 dl· ♡ 35 dl♡ 3
- 🤗dmis-lab/OSP-1.4B-100B-Muon-EmbProjmodel· 2 dl· ♡ 32 dl♡ 3
- 🤗dmis-lab/OSP-1.4B-100B-Muon-Onlymodel· 4 dl· ♡ 34 dl♡ 3
- 🤗dmis-lab/OSP-1.4B-100B-Muon-SSNormmodel· 2 dl· ♡ 32 dl♡ 3
- 🤗dmis-lab/OSP-1.4B-100B-Muon-SSNorm-EmbProjmodel· 2 dl· ♡ 42 dl♡ 4
- 🤗dmis-lab/OSP-1.4B-100B-Shampoo-SSNormmodel· 5 dl· ♡ 35 dl♡ 3
- 🤗dmis-lab/OSP-1.4B-100B-Shampoo-SSNorm-EmbProjmodel· 3 dl· ♡ 43 dl♡ 4
- 🤗dmis-lab/OSP-1.4B-1T-Adammodel· 6 dl· ♡ 36 dl♡ 3
- 🤗dmis-lab/OSP-1.4B-1T-Muon-SSNorm-EmbProjmodel· 3 dl· ♡ 43 dl♡ 4
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · COVID-19 diagnosis using AI · Radiomics and Machine Learning in Medical Imaging
