KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference   with Coupled Quantization

Tianyi Zhang; Jonah Yi; Zhaozhuo Xu; Anshumali Shrivastava

arXiv:2405.03917·cs.LG·May 8, 2024·1 cites

KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization

Tianyi Zhang, Jonah Yi, Zhaozhuo Xu, Anshumali Shrivastava

PDF

Open Access 1 Video

TL;DR

This paper introduces Coupled Quantization (CQ), a novel method that exploits inter-channel dependencies to compress key/value caches in large language models, enabling 1-bit per channel inference with minimal quality loss.

Contribution

The paper proposes Coupled Quantization, a new technique that leverages inter-channel dependencies for more efficient KV cache compression in LLMs, achieving 1-bit quantization.

Findings

01

CQ outperforms existing methods in preserving model quality.

02

CQ enables 1-bit per channel KV cache with minimal quality degradation.

03

Extensive experiments validate the effectiveness of CQ.

Abstract

Efficient deployment of Large Language Models (LLMs) requires batching multiple requests together to improve throughput. As the batch size, context length, or model size increases, the size of the key and value (KV) cache can quickly become the main contributor to GPU memory usage and the bottleneck of inference latency. Quantization has emerged as an effective technique for KV cache compression, but existing methods still fail at very low bit widths. We observe that distinct channels of a key/value activation embedding are highly inter-dependent, and the joint entropy of multiple channels grows at a slower rate than the sum of their marginal entropies. Based on this insight, we propose Coupled Quantization (CQ), which couples multiple key/value channels together to exploit their inter-dependency and encode the activations in a more information-efficient manner. Extensive experiments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization· slideslive

Taxonomy

TopicsSpeech Recognition and Synthesis · Algorithms and Data Compression · Topic Modeling