PrefixQuant: Eliminating Outliers by Prefixed Tokens for Large Language Models Quantization
Mengzhao Chen, Yi Liu, Jiahao Wang, Yi Bin, Wenqi Shao, Ping Luo

TL;DR
PrefixQuant is a training-free, efficient quantization method for LLMs that effectively isolates token-wise outliers, leading to significant accuracy improvements and speedups across various quantization settings.
Contribution
It introduces a novel token-wise outlier elimination technique using prefixing, combined with trainable parameters for error compensation, advancing LLM quantization accuracy.
Findings
Achieves state-of-the-art performance on multiple quantization levels.
Significantly outperforms existing dynamic quantization methods.
Provides notable speedups in prefill and decoding processes.
Abstract
Existing weight-activation quantization methods for Large Language Models (LLMs) primarily address channel-wise outliers but often neglect token-wise outliers, which limits the accuracy of quantized models. In this work, we propose PrefixQuant, a novel quantization method that achieves state-of-the-art performance across various precision levels (W4A4KV4 and W4A8KV4) and granularities (dynamic and static quantization) by effectively isolating token-wise outliers. First, PrefixQuant eliminates token-wise outliers by prefixing outlier tokens in the KV cache, a process that is training-free and highly efficient (e.g., 1 minutes for Llama-3-70B). Second, PrefixQuant introduces new trainable parameters for block-wise training to compensate for quantization error. Our experiments show that PrefixQuant significantly outperforms existing dynamic quantization methods, even under coarser static…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Speech Recognition and Synthesis · Natural Language Processing Techniques
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
