TreeKV: Smooth Key-Value Cache Compression with Tree Structures
Ziwei He, Jian Yuan, Haoli Bai, Jingwen Leng, Bo Jiang

TL;DR
TreeKV introduces a tree-structured, training-free cache compression method that enhances language model performance on long sequences by maintaining high quality with significantly reduced cache size.
Contribution
It presents a novel, smooth cache compression technique using tree structures, enabling effective long-sequence processing without additional training.
Findings
Outperforms baseline models on PG19 and OpenWebText2.
Enables models trained on short contexts to generalize to longer contexts with 16x cache reduction.
Achieves top performance on Longbench with only 6% of the original cache budget.
Abstract
Efficient key-value (KV) cache compression is critical for scaling transformer-based Large Language Models (LLMs) in long sequences and resource-limited settings. Existing methods evict tokens based on their positions or importance scores, but position-based strategies can miss crucial information outside predefined regions, while those relying on global importance scores resulting in strong regional biases, limiting the KV cache's overall context retention and potentially impairing the performance of LLMs on complex tasks. Our wavelet analysis reveals that as tokens approach the end of sequence, their contributions to generation gradually increase and tends to diverge more from neighboring tokens, indicating a smooth transition with increasing complexity and variability from distant to nearby context. Motivated by this observation, we propose TreeKV, an intuitive, training-free method…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Algorithms and Data Compression · Parallel Computing and Optimization Techniques
