TL;DR
ReST-KV introduces a novel KV cache eviction method for large language models that improves long-context performance and reduces latency by modeling output effects and smoothing temporal and spatial variations.
Contribution
It formulates KV eviction as an output discrepancy minimization problem using layer-wise reconstruction and incorporates spatial-temporal smoothing for robustness.
Findings
Outperforms state-of-the-art on LongBench and RULER benchmarks.
Achieves 10.61× reduction in decoding latency at 128k context length.
Consistently outperforms existing methods on multiple long-context benchmarks.
Abstract
Large language models (LLMs) face growing challenges in efficient generative inference due to the increasing memory demands of Key-Value (KV) caches, especially for long sequences. Existing eviction methods typically retain KV pairs with high attention weights but overlook the impact of attention redistribution caused by token removal, as well as the spatial-temporal dynamics in KV selection. In this paper, we propose ReST-KV, a robust KV eviction method that combines layer-wise output Reconstruction and Spatial-Temporal smoothing to provide a more comprehensive perspective for the KV cache eviction task. Specifically, ReST-KV formulates KV cache eviction as an optimization problem that minimizes output discrepancies through efficient layer-wise reconstruction. By directly modeling how each token's removal affects the model output, our method naturally captures attention redistribution…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
