Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving

Zihan Zhao; Baotong Lu; Shengjie Lin; Yizou Chen; Jing Liu; Yanqi Zhang; Ziming Miao; Ming-Chang Yang; Haiying Shen; Qi Chen; Fan Yang

arXiv:2604.26837·cs.LG·April 30, 2026

Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving

Zihan Zhao, Baotong Lu, Shengjie Lin, Yizou Chen, Jing Liu, Yanqi Zhang, Ziming Miao, Ming-Chang Yang, Haiying Shen, Qi Chen, Fan Yang

PDF

TL;DR

SPIN is a system that unifies sparse attention techniques with hierarchical memory to significantly improve long-context language model serving efficiency.

Contribution

It introduces a co-designed execution pipeline with hierarchical KV storage, enabling scalable, high-throughput long-context inference for LLMs.

Findings

01

SPIN achieves 1.66-5.66x higher throughput than vLLM.

02

SPIN reduces time-to-first-token by 7-9x.

03

SPIN cuts total cost of processing by up to 58%.

Abstract

Long-context LLM serving is bottlenecked by the cost of attending over ever-growing KV caches. Dynamic sparse attention promises relief by accessing only a small, query-dependent subset of the KV state per decoding step and extending the KV storage to CPU memory. In practice, however, these algorithmic savings rarely translate into end-to-end system-level gains because sparse methods typically operate at different granularities and thus rely on ad hoc, per-algorithm implementations. At the same time, hierarchical KV storage introduces a new systems bottleneck: retrieving fine-grained, irregular KV subsets across the GPU-CPU boundary can easily erase the benefits of sparsity. We present SPIN, a sparse-attention-aware inference framework that co-designs the execution pipeline with hierarchical KV storage through three techniques: (1) a unified partition abstraction that maps different…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.