SmallKV: Small Model Assisted Compensation of KV Cache Compression for Efficient LLM Inference

Yi Zhao; Yajuan Peng; Cam-Tu Nguyen; Zuchao Li; Xiaoliang Wang; Hai Zhao; Xiaoming Fu

arXiv:2508.02751·cs.LG·January 21, 2026

SmallKV: Small Model Assisted Compensation of KV Cache Compression for Efficient LLM Inference

Yi Zhao, Yajuan Peng, Cam-Tu Nguyen, Zuchao Li, Xiaoliang Wang, Hai Zhao, Xiaoming Fu

PDF

TL;DR

SmallKV introduces a novel approach using a small model to compensate for KV cache compression in LLMs, addressing eviction issues and improving inference efficiency in resource-limited settings.

Contribution

The paper proposes SmallKV, a new method leveraging a small model to maintain attention alignment and improve KV cache management during LLM inference.

Findings

01

Achieves 1.75-2.56x higher throughput than baseline methods.

02

Effectively maintains attention matching across different-scale LLMs.

03

Demonstrates improved performance on multiple benchmark datasets.

Abstract

KV cache eviction has emerged as an effective solution to alleviate resource constraints faced by LLMs in long-context scenarios. However, existing token-level eviction methods often overlook two critical aspects: (1) their irreversible eviction strategy fails to adapt to dynamic attention patterns during decoding (the saliency shift problem), and (2) they treat both marginally important tokens and truly unimportant tokens equally, despite the collective significance of marginal tokens to model performance (the marginal information over-compression problem). To address these issues, we design two compensation mechanisms based on the high similarity of attention matrices between LLMs of different scales. We propose SmallKV, a small model assisted compensation method for KV cache compression. SmallKV can maintain attention matching between different-scale LLMs to: 1) assist the larger…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.