LO-BCQ: Block Clustered Quantization for 4-bit (W4A4) LLM Inference
Reena Elangovan, Charbel Sakr, Anand Raghunathan, Brucek Khailany

TL;DR
LO-BCQ introduces a novel block clustered quantization method for 4-bit weight and activation inference in large language models, significantly reducing accuracy loss without additional training.
Contribution
The paper proposes LO-BCQ, a new PTQ algorithm that clusters tensor blocks and designs optimal codebooks, enabling effective 4-bit quantization of LLMs without retraining.
Findings
Achieves less than 1% accuracy loss on several LLMs.
Advances state-of-the-art in 4-bit quantization for LLM inference.
Demonstrates effective quantization of both weights and activations.
Abstract
Post-training quantization (PTQ) is a promising approach to reducing the storage and computational requirements of large language models (LLMs) without additional training cost. Recent PTQ studies have primarily focused on quantizing only weights to sub-8-bits while maintaining activations at 8-bits or higher. Accurate sub-8-bit quantization for both weights and activations without relying on quantization-aware training remains a significant challenge. We propose a novel quantization method called block clustered quantization (BCQ) wherein each operand tensor is decomposed into blocks (a block is a group of contiguous scalars), blocks are clustered based on their statistics, and a dedicated optimal quantization codebook is designed for each cluster. As a specific embodiment of this approach, we propose a PTQ algorithm called Locally-Optimal BCQ (LO-BCQ) that iterates between the steps…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBrain Tumor Detection and Classification · Image and Signal Denoising Methods · Medical Imaging Techniques and Applications
