A Method for Building Large Language Models with Predefined KV Cache   Capacity

Zhonghua Yi; Ge Niu; Lei Wang; Wei Tang; Liqiu Zhang

arXiv:2411.15785·cs.CL·November 28, 2024

A Method for Building Large Language Models with Predefined KV Cache Capacity

Zhonghua Yi, Ge Niu, Lei Wang, Wei Tang, Liqiu Zhang

PDF

Open Access

TL;DR

This paper presents the Bounded-Cache Transformer (BCT), a new method for large language models that efficiently manages KV cache capacity to reduce memory usage without sacrificing performance.

Contribution

The BCT introduces a bounded-length KV cache mechanism for Transformer models, enabling efficient inference with limited memory resources.

Findings

01

Reduces memory consumption during inference

02

Maintains inference quality with limited cache capacity

03

Demonstrates significant efficiency improvements in experiments

Abstract

This paper introduces a novel approach, the Bounded-Cache Transformer (BCT), for building large language models with a predefined Key-Value (KV) cache capacity. The BCT addresses the excessive memory consumption issue in traditional KV caches by implementing a bounded-length KV cache, which is particularly suitable for the attention layers in Transformer decode-only architectures. By dynamically updating the key-value vector sequences, the BCT achieves efficient inference within limited cache capacity, significantly reducing memory usage while maintaining model performance and system throughput. Experimental results demonstrate that the BCT significantly reduces memory usage while maintaining the model's inference quality, offering a new solution for efficient inference in large language models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Algorithms and Data Compression · Topic Modeling

MethodsDense Connections · Label Smoothing · Dropout · Linear Layer · Layer Normalization · Byte Pair Encoding · Adam · Residual Connection · Softmax · Attention Is All You Need