SALS: Sparse Attention in Latent Space for KV cache Compression
Junlin Mu, Hantao Huang, Jihang Zhang, Minghui Yu, Tao Wang, Yidong Li

TL;DR
SALS introduces a novel method for compressing Key-Value caches in large language models by projecting into a latent space and performing sparse token selection, significantly reducing memory and computational requirements while maintaining accuracy.
Contribution
The paper proposes the Sparse Attention in Latent Space (SALS) framework, enabling effective KV cache compression in LLMs with minimal accuracy loss and improved inference speed.
Findings
Achieves 6.4-fold KV cache compression and 5.7-fold speed-up over FlashAttention2.
Demonstrates 1.4-fold and 4.5-fold throughput improvements over GPT-fast on different sequence lengths.
Maintains competitive accuracy across various large-scale models and benchmarks.
Abstract
Large Language Models capable of handling extended contexts are in high demand, yet their inference remains challenging due to substantial Key-Value cache size and high memory bandwidth requirements. Previous research has demonstrated that KV cache exhibits low-rank characteristics within the hidden dimension, suggesting the potential for effective compression. However, due to the widely adopted Rotary Position Embedding mechanism in modern LLMs, naive low-rank compression suffers severe accuracy degradation or creates a new speed bottleneck, as the low-rank cache must first be reconstructed in order to apply RoPE. In this paper, we introduce two key insights: first, the application of RoPE to the key vectors increases their variance, which in turn results in a higher rank; second, after the key vectors are transformed into the latent space, they largely maintain their representation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
