MoSKA: Mixture of Shared KV Attention for Efficient Long-Sequence LLM Inference
Myunghyun Rhee, Sookyung Choi, Euiseok Kim, Joonseop Sim, Youngpyo Joo, Hoshik Kim

TL;DR
MoSKA introduces a novel shared key-value attention mechanism and hardware optimizations to significantly improve the efficiency and scalability of long-sequence LLM inference, especially with high context sharing.
Contribution
The paper proposes MoSKA, a new architecture combining shared KV attention, sparse attention pruning, and specialized hardware to enhance LLM inference performance.
Findings
Achieves up to 538.7x throughput increase over baselines.
Effectively handles high context sharing in LLM workloads.
Transforms shared data attention into compute-bound GEMM operations.
Abstract
The escalating context length in Large Language Models (LLMs) creates a severe performance bottleneck around the Key-Value (KV) cache, whose memory-bound nature leads to significant GPU under-utilization. This paper introduces Mixture of Shared KV Attention (MoSKA), an architecture that addresses this challenge by exploiting the heterogeneity of context data. It differentiates between per-request unique and massively reused shared sequences. The core of MoSKA is a novel Shared KV Attention mechanism that transforms the attention on shared data from a series of memory-bound GEMV operations into a single, compute-bound GEMM by batching concurrent requests. This is supported by an MoE-inspired sparse attention strategy that prunes the search space and a tailored Disaggregated Infrastructure that specializes hardware for unique and shared data. This comprehensive approach demonstrates a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Natural Language Processing Techniques · Big Data and Digital Economy
