BiSup: Bidirectional Quantization Error Suppression for Large Language Models
Minghui Zou, Ronghui Guo, Sai Zhang, Xiaowang Zhang, Zhiyong Feng

TL;DR
BiSup introduces a bidirectional quantization error suppression method for large language models, effectively reducing error accumulation and diffusion during weight-activation quantization, leading to improved performance and efficiency.
Contribution
The paper proposes BiSup, a novel bidirectional error suppression technique that combines parameter-efficient fine-tuning and prompt mixed-precision strategies for better quantization of LLMs.
Findings
Significant perplexity reduction on Llama and Qwen models.
Outperforms state-of-the-art quantization methods.
Enhances practical deployment of low-bit LLMs.
Abstract
As the size and context length of Large Language Models (LLMs) grow, weight-activation quantization has emerged as a crucial technique for efficient deployment of LLMs. Compared to weight-only quantization, weight-activation quantization presents greater challenges due to the presence of outliers in activations. Existing methods have made significant progress by exploring mixed-precision quantization and outlier suppression. However, these methods primarily focus on optimizing the results of single matrix multiplication, neglecting the bidirectional propagation of quantization errors in LLMs. Specifically, errors accumulate vertically within the same token through layers, and diffuse horizontally across different tokens due to self-attention mechanisms. To address this issue, we introduce BiSup, a Bidirectional quantization error Suppression method. By constructing appropriate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
MethodsFocus · LLaMA
