CORM: Cache Optimization with Recent Message for Large Language Model   Inference

Jincheng Dai; Zhuowei Huang; Haiyun Jiang; Chen Chen; Deng Cai; Wei; Bi; Shuming Shi

arXiv:2404.15949·cs.CL·June 24, 2024

CORM: Cache Optimization with Recent Message for Large Language Model Inference

Jincheng Dai, Zhuowei Huang, Haiyun Jiang, Chen Chen, Deng Cai, Wei, Bi, Shuming Shi

PDF

Open Access

TL;DR

This paper introduces CORM, a cache optimization method for large language model inference that significantly reduces KV cache memory usage by up to 70% without notable performance loss, enabling more efficient deployment.

Contribution

CORM is a novel cache eviction policy leveraging token similarity and attention dependencies, reducing memory footprint without requiring model fine-tuning.

Findings

01

KV cache memory reduced by up to 70%

02

Negligible performance degradation across six tasks

03

Compatible with GQA for additional compression

Abstract

Large Language Models (LLMs), despite their remarkable performance across a wide range of tasks, necessitate substantial GPU memory and consume significant computational resources. Beyond the memory taken up by model weights, the memory used by the KV cache rises linearly with sequence length, becoming a primary bottleneck for inference. In this paper, we introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint. Upon thorough investigation, we discover that in most Transformer models, (i) there is a striking similarity between adjacent tokens' query vectors, and (ii) the attention calculation of the current query can rely exclusively on the attention information of a small fraction of preceding queries. Based on these observations, we present CORM, a KV cache eviction policy that dynamically retains essential key-value pairs for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenetics, Bioinformatics, and Biomedical Research

MethodsAttention Is All You Need · Softmax · Layer Normalization · Byte Pair Encoding · Label Smoothing · Position-Wise Feed-Forward Layer · Dropout · Adam · Linear Layer · Absolute Position Encodings