Layer-Condensed KV Cache for Efficient Inference of Large Language   Models

Haoyi Wu; Kewei Tu

arXiv:2405.10637·cs.CL·June 5, 2024

Layer-Condensed KV Cache for Efficient Inference of Large Language Models

Haoyi Wu, Kewei Tu

PDF

Open Access 1 Repo 4 Models 1 Video

TL;DR

This paper introduces a layer-condensed KV cache method that reduces memory usage and boosts inference throughput in large language models by caching only a subset of layer KVs, achieving up to 26x speedup.

Contribution

It presents a novel approach to significantly decrease memory consumption and increase inference speed by caching KVs for fewer layers in transformer models.

Findings

01

Achieves up to 26x higher throughput than standard transformers.

02

Maintains competitive performance on language modeling and downstream tasks.

03

Compatible with existing memory-saving techniques for further efficiency.

Abstract

Huge memory consumption has been a major bottleneck for deploying high-throughput large language models in real-world applications. In addition to the large number of parameters, the key-value (KV) cache for the attention mechanism in the transformer architecture consumes a significant amount of memory, especially when the number of layers is large for deep language models. In this paper, we propose a novel method that only computes and caches the KVs of a small number of layers, thus significantly saving memory consumption and improving inference throughput. Our experiments on large language models show that our method achieves up to 26 $\times$ higher throughput than standard transformers and competitive performance in language modeling and downstream tasks. In addition, our method is orthogonal to existing transformer memory-saving techniques, so it is straightforward to integrate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

whyNLP/LCKV
pytorchOfficial

Models

Videos

Layer-Condensed KV Cache for Efficient Inference of Large Language Models· underline

Taxonomy

TopicsTopic Modeling