Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language Models

Jungwoo Park; Taewhoo Lee; Chanwoong Yoon; Hyeon Hwang; Jaewoo Kang

arXiv:2506.19697·cs.LG·June 25, 2025

Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language Models

Jungwoo Park, Taewhoo Lee, Chanwoong Yoon, Hyeon Hwang, Jaewoo Kang

PDF

Open Access 10 Models

TL;DR

This paper introduces Outlier-Safe Pre-Training (OSP), a novel training approach that prevents activation outliers in large language models, enabling more effective 4-bit quantization and efficient deployment.

Contribution

OSP combines three innovations to proactively prevent outliers during training, allowing large language models to be quantized efficiently without post-hoc mitigation.

Findings

01

First production-scale outlier-free LLM trained on 1 trillion tokens.

02

Achieves 35.7 average score on 10 benchmarks under 4-bit quantization.

03

Near-zero excess kurtosis (0.04) indicating drastically reduced outliers.

Abstract

Extreme activation outliers in Large Language Models (LLMs) critically degrade quantization performance, hindering efficient on-device deployment. While channel-wise operations and adaptive gradient scaling are recognized causes, practical mitigation remains challenging. We introduce Outlier-Safe Pre-Training (OSP), a practical guideline that proactively prevents outlier formation rather than relying on post-hoc mitigation. OSP combines three key innovations: (1) the Muon optimizer, eliminating privileged bases while maintaining training efficiency; (2) Single-Scale RMSNorm, preventing channel-wise amplification; and (3) a learnable embedding projection, redistributing activation magnitudes originating from embedding matrices. We validate OSP by training a 1.4B-parameter model on 1 trillion tokens, which is the first production-scale LLM trained without such outliers. Under aggressive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · COVID-19 diagnosis using AI · Radiomics and Machine Learning in Medical Imaging