Is Finer Better? The Limits of Microscaling Formats in Large Language Models
Andrea Fasoli, Monodeep Kar, Chi-Chun Liu, Swagath Venkataramani, Viji Srinivasan, Leland Chang, Naigang Wang

TL;DR
This paper investigates the limitations of microscaling tensor quantization in large language models, revealing a surprising degradation in model output quality at small block sizes and proposing a new hardware-friendly scale format to mitigate this issue.
Contribution
The study uncovers the counterintuitive behavior of microscaling quantization at small block sizes and introduces a novel FP8 scale format, FP8 unsigned E5M3, to improve hardware efficiency and model performance.
Findings
Quantization error increases as block size decreases below a threshold.
Theoretical framework explains the interplay between tensor distributions and quantization limits.
FP8 unsigned E5M3 scales match FP8 E4M3 performance without global scaling.
Abstract
Microscaling data formats leverage per-block tensor quantization to enable aggressive model compression with limited loss in accuracy. Unlocking their potential for efficient training and inference necessitates hardware-friendly implementations that handle matrix multiplications in a native format and adopt efficient error-mitigation strategies. Herein, we report the emergence of a surprising behavior associated with microscaling quantization, whereas the output of a quantized model degrades as block size is decreased below a given threshold. This behavior clashes with the expectation that a smaller block size should allow for a better representation of the tensor elements. We investigate this phenomenon both experimentally and theoretically, decoupling the sources of quantization error behind it. Experimentally, we analyze the distributions of several Large Language Models and identify…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper makes a highly original and significant contribution by identifying "perplexity inversion," a counter-intuitive phenomenon where smaller block sizes unexpectedly increase quantization error in microscaling for LLMs. This challenges common assumptions and reveals a critical pitfall for future low-bit quantization efforts. 2. The FP8 UE5M3 solution stands out as a key strength due to its practical and well-reasoned approach to mitigating the identified perplexity inversion. By repurpo
1. The paper effectively shows perplexity inversion with FP4 elements and FP8 UE4M3 scales. Do the authors observe similar inversion with other low-bit formats (e.g., INT4, INT8, other FP formats) and quantized scales? Clarifying if this mechanism is universally applicable or specific to the studied configuration would define the discovery's scope. 2. The paper emphasizes the hardware-friendly nature of UE5M3, particularly for inference. However, the practical implications of integrating UE5M3 d
I appreciate the solid empirical observation and thorough investigation of a subtle but important anomaly in microscaling quantization. The paper formulates a clear theoretical framework that generalizes the understanding of error behavior and matches experiments well. The analysis in figures such as Fig 2b and Fig 3c is especially compelling as it isolates the dependence on distribution width and scale quantization. The proposed UE5M3 solution is simple, hardware friendly, and demonstrate
The anomaly is a surprising phenomenon for readers and it may help to offer a concise intuitive explanation earlier in the introduction, rather than waiting until later sections, so that readers understand the high level mechanism before diving into the detailed framework. For example, a short statement that quantization of scales interacts with narrow distributions and reduces effective representable range could improve clarity. It would also be valuable to expand the discussion to other scale
Identifies the counter-intuitive "finer is worse" quantization anomaly. Develops a mathematical framework that perfectly explains the why behind the anomaly, which is a significant step beyond just observing it.
The theory is heavily based on weight distributions (modeled as Normal), with less focus on how the anomaly impacts different and often asymmetric activation distributions. The claim of "minimal" hardware cost for UE5M3 is asserted but not analyzed in-depth (e.g., no area or latency estimates).
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Tensor decomposition and applications · Machine Learning in Materials Science
