Mustafar: Promoting Unstructured Sparsity for KV Cache Pruning in LLM Inference

Donghyeon Joo; Helya Hosseini; Ramyad Hadidi; Bahar Asgari

arXiv:2505.22913·cs.LG·November 7, 2025

Mustafar: Promoting Unstructured Sparsity for KV Cache Pruning in LLM Inference

Donghyeon Joo, Helya Hosseini, Ramyad Hadidi, Bahar Asgari

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces unstructured sparsity techniques for KV cache pruning in large language model inference, achieving high compression rates and faster decoding without accuracy loss.

Contribution

It presents a novel unstructured sparsity method with a custom sparse attention kernel that improves KV cache compression and decoding speed in LLMs.

Findings

01

Achieves up to 70% sparsity without accuracy loss.

02

Compresses KV cache to 45% of dense size, enabling longer contexts.

03

Increases decoding throughput by up to 2.23x.

Abstract

We demonstrate that unstructured sparsity significantly improves KV cache compression for LLMs, enabling sparsity levels up to 70% without compromising accuracy or requiring fine-tuning. We conduct a systematic exploration of pruning strategies and find per-token magnitude-based pruning as highly effective for both Key and Value caches under unstructured sparsity, surpassing prior structured pruning schemes. The Key cache benefits from prominent outlier elements, while the Value cache surprisingly benefits from a simple magnitude-based pruning despite its uniform distribution. KV cache size is the major bottleneck in decode performance due to high memory overhead for large context lengths. To address this, we use a bitmap-based sparse format and a custom attention kernel capable of compressing and directly computing over compressed caches pruned to arbitrary sparsity patterns,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dhjoo98/mustafar
pytorchOfficial

Videos

MUSTAFAR: Promoting Unstructured Sparsity for KV Cache Pruning in LLM Inference· slideslive

Taxonomy

TopicsAdvanced Data Storage Technologies · Parallel Computing and Optimization Techniques · Network Packet Processing and Optimization

MethodsSoftmax · Attention Is All You Need · Pruning