SnapKV: LLM Knows What You are Looking for Before Generation
Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr, Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, Deming Chen

TL;DR
SnapKV is a novel method that compresses Key-Value caches in large language models, significantly improving memory and speed efficiency for long inputs without sacrificing performance.
Contribution
It introduces a fine-tuning-free approach that automatically compresses KV caches by clustering important positions, enabling efficient processing of extremely long sequences.
Findings
Achieves 3.6x faster decoding speed on 16K tokens
Enhances memory efficiency by 8.2x compared to baseline
Can process up to 380K tokens on a single GPU
Abstract
Large Language Models (LLMs) have made remarkable progress in processing extensive contexts, with the Key-Value (KV) cache playing a vital role in enhancing their performance. However, the growth of the KV cache in response to increasing input length poses challenges to memory and time efficiency. To address this problem, this paper introduces SnapKV, an innovative and fine-tuning-free approach that efficiently minimizes KV cache size while still delivering comparable performance in real-world applications. We discover that each attention head in the model consistently focuses on specific prompt attention features during generation. Meanwhile, this robust pattern can be obtained from an 'observation' window located at the end of the prompts. Drawing on this insight, SnapKV automatically compresses KV caches by selecting clustered important KV positions for each attention head. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsDigital Rights Management and Security
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
