SnapKV: LLM Knows What You are Looking for Before Generation

Yuhong Li; Yingbing Huang; Bowen Yang; Bharat Venkitesh; Acyr; Locatelli; Hanchen Ye; Tianle Cai; Patrick Lewis; Deming Chen

arXiv:2404.14469·cs.CL·June 18, 2024·5 cites

SnapKV: LLM Knows What You are Looking for Before Generation

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr, Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, Deming Chen

PDF

Open Access 1 Repo 1 Video

TL;DR

SnapKV is a novel method that compresses Key-Value caches in large language models, significantly improving memory and speed efficiency for long inputs without sacrificing performance.

Contribution

It introduces a fine-tuning-free approach that automatically compresses KV caches by clustering important positions, enabling efficient processing of extremely long sequences.

Findings

01

Achieves 3.6x faster decoding speed on 16K tokens

02

Enhances memory efficiency by 8.2x compared to baseline

03

Can process up to 380K tokens on a single GPU

Abstract

Large Language Models (LLMs) have made remarkable progress in processing extensive contexts, with the Key-Value (KV) cache playing a vital role in enhancing their performance. However, the growth of the KV cache in response to increasing input length poses challenges to memory and time efficiency. To address this problem, this paper introduces SnapKV, an innovative and fine-tuning-free approach that efficiently minimizes KV cache size while still delivering comparable performance in real-world applications. We discover that each attention head in the model consistently focuses on specific prompt attention features during generation. Meanwhile, this robust pattern can be obtained from an 'observation' window located at the end of the prompts. Drawing on this insight, SnapKV automatically compresses KV caches by selecting clustered important KV positions for each attention head. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

fasterdecoding/snapkv
pytorchOfficial

Videos

SnapKV: LLM Knows What You are Looking for Before Generation· slideslive

Taxonomy

TopicsDigital Rights Management and Security

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings