HieraSparse: Hierarchical Semi-Structured Sparse KV Attention

Haoxuan Wang; Chen Wang

arXiv:2604.16864·cs.DC·April 21, 2026

HieraSparse: Hierarchical Semi-Structured Sparse KV Attention

Haoxuan Wang, Chen Wang

PDF

1 Repo

TL;DR

HieraSparse introduces a hierarchical sparse KV Cache compression framework that accelerates long-context LLMs by leveraging GPU sparse tensor cores, achieving significant speedups with minimal quality loss.

Contribution

It presents a novel hierarchical semi-structured sparsity method for KV Cache compression that improves speed and efficiency in LLMs, with a flexible quality-sparsity trade-off.

Findings

01

Achieves 1.2x KV compression ratio and 4.57x attention speedup at the same sparsity level.

02

Extends semi-structured pruning to the prefill stage, up to 1.85x speedup.

03

Maintains generation quality with 1.37x prefill and 1.77x decode speedups without significant quality drop.

Abstract

The deployment of long-context Large Language Models (LLMs) poses significant challenges due to the intense computational cost of self-attention and the substantial memory overhead of the Key-Value Cache (KV Cache). In this paper, we introduce HieraSparse, a hierarchical KV Cache compression framework with acceleration kernels that leverage GPU sparse tensor cores to speed up semi-structured KV Cache attention for both the prefill and decode phases. With the hierarchical design, our method allows for a flexible quality-sparsity trade-off and successfully converts sparsity into efficiency. Compared to the state-of-the-art decode method that utilizes unstructured sparsity, HieraSparse achieves $1.2 \times$ KV compression ratio and $4.57 \times$ attention speedup at the same sparsity level. Furthermore, we extended the semi-structured KV Cache pruning to the prefill stage,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

psl-ntu/HieraSparse
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.