An Extra RMSNorm is All You Need for Fine Tuning to 1.58 Bits

Cody Steinmetz; Gavin Childress; Aaron Herbst; Gavin Jones; Jasdeep Singh; Eli Vang; and Keagan Weinstock

arXiv:2505.08823·cs.LG·May 15, 2025

An Extra RMSNorm is All You Need for Fine Tuning to 1.58 Bits

Cody Steinmetz, Gavin Childress, Aaron Herbst, Gavin Jones, Jasdeep Singh, Eli Vang, and Keagan Weinstock

PDF

Open Access

TL;DR

This paper introduces a simple normalization technique that enables stable, high-quality ternary quantization of large language models, significantly reducing their size and computational requirements without complex training procedures.

Contribution

The authors show that inserting RMS normalization before each linear layer and using a layer-wise quantization schedule allows effective ternary quantization of LLMs without additional complexity.

Findings

01

Achieves 1.58-bit precision with stable fine-tuning.

02

Matches or surpasses knowledge-distillation methods on benchmarks.

03

Enables practical ultra-low-bit inference for LLMs.

Abstract

Large language models (LLMs) have transformed natural-language processing, yet their scale makes real-world deployment costly. Post-training quantization reduces memory and computation but often degrades accuracy, while quantization-aware training can recover performance at the cost of extra training. Pushing quantization to the ternary (2-bit) regime yields even larger savings but is notoriously unstable. Building on recent work showing that a bias-free, RMS-normalized Transformer with straight-through estimation can reach 1.58-bit precision, we demonstrate that simply inserting RMS normalization before every linear projection and applying a gradual, layer-wise quantization schedule stably fine-tunes full-precision checkpoints into ternary LLMs. Our approach matches or surpasses more elaborate knowledge-distillation pipelines on standard language-modeling benchmarks without adding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced MEMS and NEMS Technologies

MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Dense Connections · Dropout · Layer Normalization · Byte Pair Encoding · Softmax · Absolute Position Encodings · Residual Connection