Finer is Better (with the Right Scaling)
Clemens Schaefer, Gil Tabak

TL;DR
This paper investigates the paradox where finer quantization block sizes degrade LLM quality, revealing that proper algorithmic interventions and scaling techniques can improve quantization performance.
Contribution
The study identifies the cause of the block size paradox and proposes algorithmic solutions that enable standard quantization formats to outperform or match custom formats.
Findings
Proper scaling prevents underflow and reduces localized errors.
Algorithmic interventions like 4-over-6 improve quantization geometry.
Finer block sizes with the right methods strictly reduce mean squared error.
Abstract
Microscaling is a critical technique for preserving the quality of Large Language Models (LLMs) quantized to ultra-low precision formats. Intuitively, finer block sizes should yield lower quantization error; however, a paradox recently identified in the literature demonstrates that standard abs-max scaling can actually degrade model quality as block sizes shrink. In this work, we investigate the underlying mechanics of this phenomenon. We demonstrate that this degradation is not an inherent limitation of finer granularity, but is primarily driven by heavy-tailed tensor distributions interacting poorly with the coarse upper quantization bins of the FP4 element format. Specifically, we show that i) preventing the scaling factor from underflowing to zero mitigates localized errors, ii) targeted algorithmic interventions like the 4-over-6 methodology effectively correct the quantization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
