PrefixQuant: Eliminating Outliers by Prefixed Tokens for Large Language   Models Quantization

Mengzhao Chen; Yi Liu; Jiahao Wang; Yi Bin; Wenqi Shao; Ping Luo

arXiv:2410.05265·cs.LG·January 28, 2025

PrefixQuant: Eliminating Outliers by Prefixed Tokens for Large Language Models Quantization

Mengzhao Chen, Yi Liu, Jiahao Wang, Yi Bin, Wenqi Shao, Ping Luo

PDF

Open Access 1 Repo

TL;DR

PrefixQuant is a training-free, efficient quantization method for LLMs that effectively isolates token-wise outliers, leading to significant accuracy improvements and speedups across various quantization settings.

Contribution

It introduces a novel token-wise outlier elimination technique using prefixing, combined with trainable parameters for error compensation, advancing LLM quantization accuracy.

Findings

01

Achieves state-of-the-art performance on multiple quantization levels.

02

Significantly outperforms existing dynamic quantization methods.

03

Provides notable speedups in prefill and decoding processes.

Abstract

Existing weight-activation quantization methods for Large Language Models (LLMs) primarily address channel-wise outliers but often neglect token-wise outliers, which limits the accuracy of quantized models. In this work, we propose PrefixQuant, a novel quantization method that achieves state-of-the-art performance across various precision levels (W4A4KV4 and W4A8KV4) and granularities (dynamic and static quantization) by effectively isolating token-wise outliers. First, PrefixQuant eliminates token-wise outliers by prefixing outlier tokens in the KV cache, a process that is training-free and highly efficient (e.g., 1 minutes for Llama-3-70B). Second, PrefixQuant introduces new trainable parameters for block-wise training to compensate for quantization error. Our experiments show that PrefixQuant significantly outperforms existing dynamic quantization methods, even under coarser static…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

chenmnz/prefixquant
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Speech Recognition and Synthesis · Natural Language Processing Techniques

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings