Search Your Block Floating Point Scales!
Tanmaey Gupta, Hayden Prairie, Xiaoxia Wu, Reyna Abhyankar, Qingyang Wu, Austin Silveria, Pragaash Ponnusamy, Jue Wang, Ben Athiwaratkun, Leon Song, Tri Dao, Daniel Y. Fu, and Chris De Sa

TL;DR
This paper introduces ScaleSearch, a method for optimizing block floating point scales to reduce quantization errors, and demonstrates its effectiveness in improving low-precision inference for language models.
Contribution
It proposes a novel scale selection strategy for block floating point formats and integrates it with existing quantization techniques, enhancing their accuracy and efficiency.
Findings
ScaleSearch reduces quantization error by 27% for NVFP4.
It improves language model PTQ accuracy by up to 15 points on MATH500.
ScaleSearchAttention enhances Wikitext-2 perplexity by up to 0.77 points for Llama 3.1 70B.
Abstract
Quantization has emerged as a standard technique for accelerating inference for generative models by enabling faster low-precision computations and reduced memory transfers. Recently, GPU accelerators have added first-class support for microscaling Block Floating Point (BFP) formats. Standard BFP algorithms use a fixed scale based on the maximum magnitude of the block. We observe that this scale choice can be suboptimal with respect to quantization errors. In this work, we propose ScaleSearch, an alternative strategy for selecting these scale factors: using a fine-grained search leveraging the mantissa bits in microscaling formats to minimize the quantization error for the given distribution. ScaleSearch can be integrated with existing quantization methods such as Post Training Quantization and low-precision attention, and is shown to improve their performance. Additionally, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
