BlockDialect: Block-wise Fine-grained Mixed Format Quantization for Energy-Efficient LLM Inference

Wonsuk Jang; Thierry Tambe

arXiv:2501.01144·cs.CL·July 25, 2025

BlockDialect: Block-wise Fine-grained Mixed Format Quantization for Energy-Efficient LLM Inference

Wonsuk Jang, Thierry Tambe

PDF

Open Access 1 Repo

TL;DR

BlockDialect introduces a block-wise mixed format quantization method for large language models, improving accuracy and energy efficiency by adaptively selecting optimal data formats per block and employing a specialized FP4 formatbook.

Contribution

It proposes a novel block-wise mixed format quantization technique with DialectFP4, enabling better data representation and energy-efficient LLM inference.

Findings

01

Achieves up to 10.78% accuracy gain on LLaMA3-8B.

02

Reduces bit usage per data compared to previous formats.

03

Maintains close to full precision accuracy with lower energy consumption.

Abstract

The rapidly increasing size of large language models (LLMs) presents significant challenges in memory usage and computational costs. Quantizing both weights and activations can address these issues, with hardware-supported fine-grained scaling emerging as a promising solution to mitigate outliers. However, existing methods struggle to capture nuanced block data distributions. We propose BlockDialect, a block-wise fine-grained mixed format technique that assigns a per-block optimal number format from a formatbook for better data representation. Additionally, we introduce DialectFP4, a formatbook of FP4 variants (akin to dialects) that adapt to diverse data distributions. To leverage this efficiently, we propose a two-stage approach for online DialectFP4 activation quantization. Importantly, DialectFP4 ensures energy efficiency by selecting representable values as scaled integers…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://code.stanford.edu/tambe-lab/blockdialect
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Algorithms and Data Compression