Get More with LESS: Synthesizing Recurrence with KV Cache Compression   for Efficient LLM Inference

Harry Dong; Xinyu Yang; Zhenyu Zhang; Zhangyang Wang; Yuejie Chi,; Beidi Chen

arXiv:2402.09398·cs.LG·June 13, 2024·1 cites

Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference

Harry Dong, Xinyu Yang, Zhenyu Zhang, Zhangyang Wang, Yuejie Chi,, Beidi Chen

PDF

Open Access 1 Repo

TL;DR

This paper introduces LESS, a method combining a small, constant-sized cache with eviction strategies to efficiently retain token information in large language model inference, reducing memory use while maintaining performance.

Contribution

LESS is a novel approach that integrates a nearly free, constant-sized cache with eviction-based methods to improve memory efficiency in LLM inference.

Findings

01

LESS can match or outperform full cache methods in various tasks.

02

It significantly reduces memory footprint during inference.

03

The approach maintains high performance with minimal additional computational cost.

Abstract

Many computational factors limit broader deployment of large language models. In this paper, we focus on a memory bottleneck imposed by the key-value (KV) cache, a computational shortcut that requires storing previous KV pairs during decoding. While existing KV cache methods approach this problem by pruning or evicting large swaths of relatively less important KV pairs to dramatically reduce the memory footprint of the cache, they can have limited success in tasks that require recollecting a majority of previous tokens. To alleviate this issue, we propose LESS, a simple integration of a (nearly free) constant sized cache with eviction-based cache methods, such that all tokens can be queried at later decoding steps. Its ability to retain information throughout time shows merit on a variety of tasks where we demonstrate LESS can help reduce the performance gap from caching everything,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hdong920/less
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Advanced Data Storage Technologies

MethodsPruning · Focus