A Method for Building Large Language Models with Predefined KV Cache Capacity
Zhonghua Yi, Ge Niu, Lei Wang, Wei Tang, Liqiu Zhang

TL;DR
This paper presents the Bounded-Cache Transformer (BCT), a new method for large language models that efficiently manages KV cache capacity to reduce memory usage without sacrificing performance.
Contribution
The BCT introduces a bounded-length KV cache mechanism for Transformer models, enabling efficient inference with limited memory resources.
Findings
Reduces memory consumption during inference
Maintains inference quality with limited cache capacity
Demonstrates significant efficiency improvements in experiments
Abstract
This paper introduces a novel approach, the Bounded-Cache Transformer (BCT), for building large language models with a predefined Key-Value (KV) cache capacity. The BCT addresses the excessive memory consumption issue in traditional KV caches by implementing a bounded-length KV cache, which is particularly suitable for the attention layers in Transformer decode-only architectures. By dynamically updating the key-value vector sequences, the BCT achieves efficient inference within limited cache capacity, significantly reducing memory usage while maintaining model performance and system throughput. Experimental results demonstrate that the BCT significantly reduces memory usage while maintaining the model's inference quality, offering a new solution for efficient inference in large language models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Algorithms and Data Compression · Topic Modeling
MethodsDense Connections · Label Smoothing · Dropout · Linear Layer · Layer Normalization · Byte Pair Encoding · Adam · Residual Connection · Softmax · Attention Is All You Need
