More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression
Jiebin Zhang, Dawei Zhu, Yifan Song, Wenhao Wu, Chuqiao Kuang,, Xiaoguang Li, Lifeng Shang, Qun Liu, Sujian Li

TL;DR
This paper explores the trade-off between token quantity and precision in KV cache compression for large language models, proposing a strategy called quantized pruning that improves long-context performance and efficiency.
Contribution
It introduces quantized pruning, a novel approach balancing token count and precision, to optimize KV cache compression in LLMs.
Findings
Quantized pruning significantly improves long-context performance.
It enhances retrieval tasks across various input lengths.
The method is stable across different models and compression strategies.
Abstract
As large language models (LLMs) process increasing context windows, the memory usage of KV cache has become a critical bottleneck during inference. The mainstream KV compression methods, including KV pruning and KV quantization, primarily focus on either token or precision dimension separately. However, these works leaving the trade-off between these two orthogonal dimensions largely under-explored. In this paper, we comprehensively investigate the token-precision trade-off in KV cache compression.Experiments demonstrate that storing more tokens in the KV cache with lower precision,a strategy we term quantized pruning, can significantly enhance the long-context performance of LLMs. In-depth analysis of the token-precision trade-off across key aspects demonstrates that, quantized pruning achieves substantial improvements in retrieval-related tasks and consistently performs well across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Parallel Computing and Optimization Techniques · Algorithms and Data Compression
MethodsFocus · Pruning
