EVICPRESS: Joint KV-Cache Compression and Eviction for Efficient LLM Serving

Shaoting Feng; Yuhan Liu; Hanchen Li; Xiaokun Chen; Samuel Shen; Kuntai Du; Zhuohan Gu; Rui Zhang; Yuyang Huang; Yihua Cheng; Jiayi Yao; Qizheng Zhang; Ganesh Ananthanarayanan; Junchen Jiang

arXiv:2512.14946·cs.OS·December 18, 2025

EVICPRESS: Joint KV-Cache Compression and Eviction for Efficient LLM Serving

Shaoting Feng, Yuhan Liu, Hanchen Li, Xiaokun Chen, Samuel Shen, Kuntai Du, Zhuohan Gu, Rui Zhang, Yuyang Huang, Yihua Cheng, Jiayi Yao, Qizheng Zhang, Ganesh Ananthanarayanan, Junchen Jiang

PDF

Open Access

TL;DR

EVICPRESS is a system that jointly optimizes KV-cache compression and eviction across multiple storage tiers to reduce latency and maintain quality in large language model inference.

Contribution

It introduces a unified utility-based approach for jointly managing cache eviction and lossy compression across storage tiers in LLM serving.

Findings

01

Up to 2.19x faster TTFT at same quality

02

Higher cache hit rates on fast devices

03

Effective preservation of generation quality

Abstract

Reusing KV cache is essential for high efficiency of Large Language Model (LLM) inference systems. With more LLM users, the KV cache footprint can easily exceed GPU memory capacity, so prior work has proposed to either evict KV cache to lower-tier storage devices, or compress KV cache so that more KV cache can be fit in the fast memory. However, prior work misses an important opportunity: jointly optimizing the eviction and compression decisions across all KV caches to minimize average generation latency without hurting quality. We propose EVICPRESS, a KV-cache management system that applies lossy compression and adaptive eviction to KV cache across multiple storage tiers. Specifically, for each KV cache of a context, EVICPRESS considers the effect of compression and eviction of the KV cache on the average generation quality and delay across all contexts as a whole. To achieve this,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Big Data and Digital Economy · Cloud Computing and Resource Management