Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving
Zihan Zhao, Baotong Lu, Shengjie Lin, Yizou Chen, Jing Liu, Yanqi Zhang, Ziming Miao, Ming-Chang Yang, Haiying Shen, Qi Chen, Fan Yang

TL;DR
SPIN is a system that unifies sparse attention techniques with hierarchical memory to significantly improve long-context language model serving efficiency.
Contribution
It introduces a co-designed execution pipeline with hierarchical KV storage, enabling scalable, high-throughput long-context inference for LLMs.
Findings
SPIN achieves 1.66-5.66x higher throughput than vLLM.
SPIN reduces time-to-first-token by 7-9x.
SPIN cuts total cost of processing by up to 58%.
Abstract
Long-context LLM serving is bottlenecked by the cost of attending over ever-growing KV caches. Dynamic sparse attention promises relief by accessing only a small, query-dependent subset of the KV state per decoding step and extending the KV storage to CPU memory. In practice, however, these algorithmic savings rarely translate into end-to-end system-level gains because sparse methods typically operate at different granularities and thus rely on ad hoc, per-algorithm implementations. At the same time, hierarchical KV storage introduces a new systems bottleneck: retrieving fine-grained, irregular KV subsets across the GPU-CPU boundary can easily erase the benefits of sparsity. We present SPIN, a sparse-attention-aware inference framework that co-designs the execution pipeline with hierarchical KV storage through three techniques: (1) a unified partition abstraction that maps different…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
