More Tokens, Lower Precision: Towards the Optimal Token-Precision   Trade-off in KV Cache Compression

Jiebin Zhang; Dawei Zhu; Yifan Song; Wenhao Wu; Chuqiao Kuang,; Xiaoguang Li; Lifeng Shang; Qun Liu; Sujian Li

arXiv:2412.12706·cs.CL·February 21, 2025

More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression

Jiebin Zhang, Dawei Zhu, Yifan Song, Wenhao Wu, Chuqiao Kuang,, Xiaoguang Li, Lifeng Shang, Qun Liu, Sujian Li

PDF

Open Access

TL;DR

This paper explores the trade-off between token quantity and precision in KV cache compression for large language models, proposing a strategy called quantized pruning that improves long-context performance and efficiency.

Contribution

It introduces quantized pruning, a novel approach balancing token count and precision, to optimize KV cache compression in LLMs.

Findings

01

Quantized pruning significantly improves long-context performance.

02

It enhances retrieval tasks across various input lengths.

03

The method is stable across different models and compression strategies.

Abstract

As large language models (LLMs) process increasing context windows, the memory usage of KV cache has become a critical bottleneck during inference. The mainstream KV compression methods, including KV pruning and KV quantization, primarily focus on either token or precision dimension separately. However, these works leaving the trade-off between these two orthogonal dimensions largely under-explored. In this paper, we comprehensively investigate the token-precision trade-off in KV cache compression.Experiments demonstrate that storing more tokens in the KV cache with lower precision,a strategy we term quantized pruning, can significantly enhance the long-context performance of LLMs. In-depth analysis of the token-precision trade-off across key aspects demonstrates that, quantized pruning achieves substantial improvements in retrieval-related tasks and consistently performs well across…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Storage Technologies · Parallel Computing and Optimization Techniques · Algorithms and Data Compression

MethodsFocus · Pruning