QAQ: Quality Adaptive Quantization for LLM KV Cache

Shichen Dong; Wen Cheng; Jiayu Qin; Wei Wang

arXiv:2403.04643·cs.CL·April 15, 2024·3 cites

QAQ: Quality Adaptive Quantization for LLM KV Cache

Shichen Dong, Wen Cheng, Jiayu Qin, Wei Wang

PDF

Open Access 1 Repo

TL;DR

QAQ introduces a novel, adaptive quantization method for LLM KV caches that significantly compresses cache size with minimal performance loss, enabling longer-context applications and more efficient deployment.

Contribution

The paper presents QAQ, a new quantization scheme that separately optimizes key and value caches, incorporating outlier handling and attention-awareness for superior compression.

Findings

01

Achieves up to 10x compression ratio of KV cache size

02

Negligible impact on model performance

03

Enables longer-context LLM applications

Abstract

The emergence of LLMs has ignited a fresh surge of breakthroughs in NLP applications, particularly in domains such as question-answering systems and text generation. As the need for longer context grows, a significant bottleneck in model deployment emerges due to the linear expansion of the Key-Value (KV) cache with the context length. Existing methods primarily rely on various hypotheses, such as sorting the KV cache based on attention scores for replacement or eviction, to compress the KV cache and improve model throughput. However, heuristics used by these strategies may wrongly evict essential KV cache, which can significantly degrade model performance. In this paper, we propose QAQ, a Quality Adaptive Quantization scheme for the KV cache. We theoretically demonstrate that key cache and value cache exhibit distinct sensitivities to quantization, leading to the formulation of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

clubiedong/kvcachequantization
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Compression Techniques · Algorithms and Data Compression · Advanced Data Storage Technologies