DecDEC: A Systems Approach to Advancing Low-Bit LLM Quantization
Yeonhong Park, Jake Hyun, Hojoon Kim, Jae W. Lee

TL;DR
DecDEC is a novel system that enhances low-bit quantized large language models by selectively correcting salient channels using residuals, significantly improving model quality with minimal additional memory and latency.
Contribution
DecDEC introduces a dynamic residual fetching method that improves low-bit LLM quantization accuracy while maintaining efficiency.
Findings
Reduces perplexity of 3-bit Llama-3-8B-Instruct from 10.15 to 9.12
Adds less than 0.0003% GPU memory overhead
Increases inference latency by only 1.7% on NVIDIA RTX 4050 Mobile
Abstract
Quantization of Large Language Models (LLMs) has recently gained popularity, particularly for on-device settings with limited hardware resources. While efficient, quantization inevitably degrades model quality, especially in aggressive low-bit settings such as 3-bit and 4-bit precision. In this paper, we propose DecDEC, an inference scheme that improves the quality of low-bit LLMs while preserving the key benefits of quantization: GPU memory savings and latency reduction. DecDEC stores the residual matrix -- the difference between full-precision and quantized weights -- in CPU, and dynamically fetches the residuals for only a small portion of the weights. This portion corresponds to the salient channels, marked by activation outliers, with the fetched residuals helping to correct quantization errors in these channels. Salient channels are identified dynamically at each decoding step by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVLSI and Analog Circuit Testing · Advancements in Photolithography Techniques · Advancements in Semiconductor Devices and Circuit Design
