SWAN: Sparse Winnowed Attention for Reduced Inference Memory via Decompression-Free KV-Cache Compression
Santhosh G S, Saurav Prakash, Balaraman Ravindran

TL;DR
SWAN is a novel, decompression-free KV-cache compression method for LLMs that significantly reduces memory usage during inference while maintaining near-baseline performance, offering runtime-tunable compression levels.
Contribution
SWAN introduces a fine-tuning-free, orthogonal matrix-based approach for direct KV-cache compression, enabling dynamic memory reduction without information loss or decompression overhead.
Findings
Maintains performance close to uncompressed models at 50-60% memory savings.
Offers runtime-tunable compression levels for flexible memory management.
Outperforms existing compression methods in efficiency and flexibility.
Abstract
Large Language Models (LLMs) face a significant bottleneck during autoregressive inference due to the massive memory footprint of the Key-Value (KV) cache. Existing compression techniques like token eviction, quantization, or other low-rank methods often risk information loss, have fixed limits, or introduce significant computational overhead from explicit decompression steps. In this work, we introduce SWAN, a novel, fine-tuning-free framework that eliminates this overhead. Our method uses an offline orthogonal matrix to rotate and prune the KV-cache, which is then used directly in the attention computation without any reconstruction. Our extensive experiments demonstrate that SWAN, augmented with a small dense buffer, offers a robust trade-off, maintaining performance close to the uncompressed baseline even at aggressive 50-60% memory savings per-token on KV-cache. A key advantage is…
Peer Reviews
Decision·Submitted to ICLR 2026
- eliminates reconstruction overhead by performing attention directly on compressed KV caches. - combines a sparse historical cache with a small dense buffer for recent tokens, effectively preserving accuracy.
- while theoretical compute savings are analyzed, the paper does not provide concrete wall-clock latency or throughput comparisons on modern GPU kernels (e.g., FlashAttention or Triton baselines), leaving practical efficiency uncertain. - the claimed compute benefits rely on sparse-dense matvec operations, but these are often inefficient on current GPU hardware; implementation feasibility and actual speedups are not validated. - applying the orthogonal projection to queries and keys at each deco
* The paper is clearly written and well-structured, making it easy to follow. * It introduces an interesting compression approach that converts the KV-cache into a hybrid sparse–dense representation and performs attention computations directly on the compressed cache without decompression. * The accuracy evaluation is comprehensive, covering a diverse range of tasks—from mathematical reasoning to commonsense understanding and long-context processing—demonstrating the method’s generality.
* The paper lacks a solid system-level implementation to substantiate its claimed efficiency. The computational savings are analyzed only theoretically, without validation through real runtime measurements. Since the method depends on storing pruned tensors in a sparse (CSR) format, which is typically inefficient unless sparsity is extremely high (>99%), it is unclear whether the reported compression ratios (30–50%)—where accuracy is largely preserved—actually yield any practical speedup. * The
+ **Decompression-free design:** SWAN allows attention to run directly on a sparse cache, removing the need for reconstruction or merging operations that typically introduce overhead in low-rank or codec approaches. + **Clear, implementable mechanism:** Algorithm 1 precisely specifies runtime steps (project, buffer, prune-to-top-k, append to sparse cache, then hybrid attention), and Fig. 1 clarifies the data path.
+ **No latency or throughput evaluation:** Although a theoretical efficiency analysis is provided, no empirical runtime measurements are presented. Wall-clock latency, throughput, or per-step breakdowns (prefill vs. decode) are missing, making it unclear how much real-world speedup SWAN achieves. + **No any baseline comparisons:** The paper does not compare against any prior baseline approach. In particular, recent hidden-dimension compression methods such as Palu (low-rank) and EigenAttention
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Natural Language Processing Techniques · Topic Modeling
