Layer-Condensed KV Cache for Efficient Inference of Large Language Models
Haoyi Wu, Kewei Tu

TL;DR
This paper introduces a layer-condensed KV cache method that reduces memory usage and boosts inference throughput in large language models by caching only a subset of layer KVs, achieving up to 26x speedup.
Contribution
It presents a novel approach to significantly decrease memory consumption and increase inference speed by caching KVs for fewer layers in transformer models.
Findings
Achieves up to 26x higher throughput than standard transformers.
Maintains competitive performance on language modeling and downstream tasks.
Compatible with existing memory-saving techniques for further efficiency.
Abstract
Huge memory consumption has been a major bottleneck for deploying high-throughput large language models in real-world applications. In addition to the large number of parameters, the key-value (KV) cache for the attention mechanism in the transformer architecture consumes a significant amount of memory, especially when the number of layers is large for deep language models. In this paper, we propose a novel method that only computes and caches the KVs of a small number of layers, thus significantly saving memory consumption and improving inference throughput. Our experiments on large language models show that our method achieves up to 26 higher throughput than standard transformers and competitive performance in language modeling and downstream tasks. In addition, our method is orthogonal to existing transformer memory-saving techniques, so it is straightforward to integrate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling
