TL;DR
HieraSparse introduces a hierarchical sparse KV Cache compression framework that accelerates long-context LLMs by leveraging GPU sparse tensor cores, achieving significant speedups with minimal quality loss.
Contribution
It presents a novel hierarchical semi-structured sparsity method for KV Cache compression that improves speed and efficiency in LLMs, with a flexible quality-sparsity trade-off.
Findings
Achieves 1.2x KV compression ratio and 4.57x attention speedup at the same sparsity level.
Extends semi-structured pruning to the prefill stage, up to 1.85x speedup.
Maintains generation quality with 1.37x prefill and 1.77x decode speedups without significant quality drop.
Abstract
The deployment of long-context Large Language Models (LLMs) poses significant challenges due to the intense computational cost of self-attention and the substantial memory overhead of the Key-Value Cache (KV Cache). In this paper, we introduce HieraSparse, a hierarchical KV Cache compression framework with acceleration kernels that leverage GPU sparse tensor cores to speed up semi-structured KV Cache attention for both the prefill and decode phases. With the hierarchical design, our method allows for a flexible quality-sparsity trade-off and successfully converts sparsity into efficiency. Compared to the state-of-the-art decode method that utilizes unstructured sparsity, HieraSparse achieves KV compression ratio and attention speedup at the same sparsity level. Furthermore, we extended the semi-structured KV Cache pruning to the prefill stage,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
