BiSup: Bidirectional Quantization Error Suppression for Large Language   Models

Minghui Zou; Ronghui Guo; Sai Zhang; Xiaowang Zhang; Zhiyong Feng

arXiv:2405.15346·cs.CL·May 27, 2024

BiSup: Bidirectional Quantization Error Suppression for Large Language Models

Minghui Zou, Ronghui Guo, Sai Zhang, Xiaowang Zhang, Zhiyong Feng

PDF

Open Access

TL;DR

BiSup introduces a bidirectional quantization error suppression method for large language models, effectively reducing error accumulation and diffusion during weight-activation quantization, leading to improved performance and efficiency.

Contribution

The paper proposes BiSup, a novel bidirectional error suppression technique that combines parameter-efficient fine-tuning and prompt mixed-precision strategies for better quantization of LLMs.

Findings

01

Significant perplexity reduction on Llama and Qwen models.

02

Outperforms state-of-the-art quantization methods.

03

Enhances practical deployment of low-bit LLMs.

Abstract

As the size and context length of Large Language Models (LLMs) grow, weight-activation quantization has emerged as a crucial technique for efficient deployment of LLMs. Compared to weight-only quantization, weight-activation quantization presents greater challenges due to the presence of outliers in activations. Existing methods have made significant progress by exploring mixed-precision quantization and outlier suppression. However, these methods primarily focus on optimizing the results of single matrix multiplication, neglecting the bidirectional propagation of quantization errors in LLMs. Specifically, errors accumulate vertically within the same token through layers, and diffuse horizontally across different tokens due to self-attention mechanisms. To address this issue, we introduce BiSup, a Bidirectional quantization error Suppression method. By constructing appropriate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis

MethodsFocus · LLaMA