BlockDialect: Block-wise Fine-grained Mixed Format Quantization for Energy-Efficient LLM Inference
Wonsuk Jang, Thierry Tambe

TL;DR
BlockDialect introduces a block-wise mixed format quantization method for large language models, improving accuracy and energy efficiency by adaptively selecting optimal data formats per block and employing a specialized FP4 formatbook.
Contribution
It proposes a novel block-wise mixed format quantization technique with DialectFP4, enabling better data representation and energy-efficient LLM inference.
Findings
Achieves up to 10.78% accuracy gain on LLaMA3-8B.
Reduces bit usage per data compared to previous formats.
Maintains close to full precision accuracy with lower energy consumption.
Abstract
The rapidly increasing size of large language models (LLMs) presents significant challenges in memory usage and computational costs. Quantizing both weights and activations can address these issues, with hardware-supported fine-grained scaling emerging as a promising solution to mitigate outliers. However, existing methods struggle to capture nuanced block data distributions. We propose BlockDialect, a block-wise fine-grained mixed format technique that assigns a per-block optimal number format from a formatbook for better data representation. Additionally, we introduce DialectFP4, a formatbook of FP4 variants (akin to dialects) that adapt to diverse data distributions. To leverage this efficiently, we propose a two-stage approach for online DialectFP4 activation quantization. Importantly, DialectFP4 ensures energy efficiency by selecting representable values as scaled integers…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Algorithms and Data Compression
