SKVQ: Sliding-window Key and Value Cache Quantization for Large Language   Models

Haojie Duanmu; Zhihang Yuan; Xiuhong Li; Jiangfei Duan; Xingcheng; Zhang; Dahua Lin

arXiv:2405.06219·cs.LG·November 13, 2024

SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models

Haojie Duanmu, Zhihang Yuan, Xiuhong Li, Jiangfei Duan, Xingcheng, Zhang, Dahua Lin

PDF

Open Access

TL;DR

The paper introduces SKVQ, a novel sliding-window KV cache quantization method that significantly reduces memory usage in large language models while preserving accuracy, enabling longer context processing and faster decoding.

Contribution

SKVQ is the first approach to effectively quantize KV caches to extremely low bitwidths with high compression and minimal accuracy loss.

Findings

01

Achieves 2-bit keys and 1.5-bit values quantization with minimal accuracy loss.

02

Enables processing of up to 1 million tokens on an 80GB GPU.

03

Provides up to 7x faster decoding for large language models.

Abstract

Large language models (LLMs) can now handle longer sequences of tokens, enabling complex tasks like book understanding and generating lengthy novels. However, the key-value (KV) cache required for LLMs consumes substantial memory as context length increasing, becoming the bottleneck for deployment. In this paper, we present a strategy called SKVQ, which stands for sliding-window KV cache quantization, to address the issue of extremely low bitwidth KV cache quantization. To achieve this, SKVQ rearranges the channels of the KV cache in order to improve the similarity of channels in quantization groups, and applies clipped dynamic quantization at the group level. Additionally, SKVQ ensures that the most recent window tokens in the KV cache are preserved with high precision. This helps maintain the accuracy of a small but important portion of the KV cache.SKVQ achieves high compression…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling