Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient   Generative Inference

Muhammad Adnan; Akhil Arunkumar; Gaurav Jain; Prashant J.; Nair; Ilya Soloveychik; Purushotham Kamath

arXiv:2403.09054·cs.LG·April 9, 2024·6 cites

Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference

Muhammad Adnan, Akhil Arunkumar, Gaurav Jain, Prashant J., Nair, Ilya Soloveychik, Purushotham Kamath

PDF

Open Access 1 Repo

TL;DR

Keyformer is a novel method that reduces memory bandwidth and KV cache size during generative inference by selecting key tokens, significantly accelerating inference without sacrificing accuracy.

Contribution

It introduces a new token selection technique that effectively reduces KV cache size and bandwidth usage in LLM inference, enhancing efficiency for long-context tasks.

Findings

01

KV cache size reduced by up to 90%

02

Inference latency decreased by 2.1x

03

Token throughput increased by 2.4x

Abstract

Transformers have emerged as the underpinning architecture for Large Language Models (LLMs). In generative language models, the inference process involves two primary phases: prompt processing and token generation. Token generation, which constitutes the majority of the computational workload, primarily entails vector-matrix multiplications and interactions with the Key-Value (KV) Cache. This phase is constrained by memory bandwidth due to the overhead of transferring weights and KV cache values from the memory system to the computing units. This memory bottleneck becomes particularly pronounced in applications that require long-context and extensive text generation, both of which are increasingly crucial for LLMs. This paper introduces "Keyformer", an innovative inference-time approach, to mitigate the challenges associated with KV cache size and memory bandwidth utilization.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

d-matrix-ai/keyformer-llm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Network Packet Processing and Optimization · Advanced Data Storage Technologies