Unlocking Data-free Low-bit Quantization with Matrix Decomposition for   KV Cache Compression

Peiyu Liu; Ze-Feng Gao; Wayne Xin Zhao; Yipeng Ma; Tao Wang; Ji-Rong; Wen

arXiv:2405.12591·cs.CL·May 22, 2024

Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression

Peiyu Liu, Ze-Feng Gao, Wayne Xin Zhao, Yipeng Ma, Tao Wang, Ji-Rong, Wen

PDF

Open Access 1 Repo 1 Video

TL;DR

DecoQuant is a data-free low-bit quantization method using tensor decomposition to compress KV caches in LLMs, significantly reducing memory usage while preserving inference quality.

Contribution

It introduces a novel tensor decomposition-based quantization approach that effectively handles outliers, enabling efficient, data-free KV cache compression for large language models.

Findings

01

Achieves up to 75% memory reduction in KV cache.

02

Maintains comparable inference quality with significant compression.

03

Provides an efficient dequantization kernel tailored for DecoQuant.

Abstract

Key-value~(KV) caching is an important technique to accelerate the inference of large language models~(LLMs), but incurs significant memory overhead. To compress the size of KV cache, existing methods often compromise precision or require extra data for calibration, limiting their practicality in LLM deployment. In this paper, we introduce \textbf{DecoQuant}, a novel data-free low-bit quantization technique based on tensor decomposition methods, to effectively compress KV cache. Our core idea is to adjust the outlier distribution of the original matrix by performing tensor decomposition, so that the quantization difficulties are migrated from the matrix to decomposed local tensors. Specially, we find that outliers mainly concentrate on small local tensors, while large tensors tend to have a narrower value range. Based on this finding, we propose to apply low-bit quantization to the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lpyhdzx/DecoQuant_code
pytorchOfficial

Videos

Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression· underline

Taxonomy

TopicsAdvanced Data Compression Techniques · Error Correcting Code Techniques · Algorithms and Data Compression