Mustafar: Promoting Unstructured Sparsity for KV Cache Pruning in LLM Inference
Donghyeon Joo, Helya Hosseini, Ramyad Hadidi, Bahar Asgari

TL;DR
This paper introduces unstructured sparsity techniques for KV cache pruning in large language model inference, achieving high compression rates and faster decoding without accuracy loss.
Contribution
It presents a novel unstructured sparsity method with a custom sparse attention kernel that improves KV cache compression and decoding speed in LLMs.
Findings
Achieves up to 70% sparsity without accuracy loss.
Compresses KV cache to 45% of dense size, enabling longer contexts.
Increases decoding throughput by up to 2.23x.
Abstract
We demonstrate that unstructured sparsity significantly improves KV cache compression for LLMs, enabling sparsity levels up to 70% without compromising accuracy or requiring fine-tuning. We conduct a systematic exploration of pruning strategies and find per-token magnitude-based pruning as highly effective for both Key and Value caches under unstructured sparsity, surpassing prior structured pruning schemes. The Key cache benefits from prominent outlier elements, while the Value cache surprisingly benefits from a simple magnitude-based pruning despite its uniform distribution. KV cache size is the major bottleneck in decode performance due to high memory overhead for large context lengths. To address this, we use a bitmap-based sparse format and a custom attention kernel capable of compressing and directly computing over compressed caches pruned to arbitrary sparsity patterns,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Data Storage Technologies · Parallel Computing and Optimization Techniques · Network Packet Processing and Optimization
MethodsSoftmax · Attention Is All You Need · Pruning
