MoSKA: Mixture of Shared KV Attention for Efficient Long-Sequence LLM Inference

Myunghyun Rhee; Sookyung Choi; Euiseok Kim; Joonseop Sim; Youngpyo Joo; Hoshik Kim

arXiv:2511.06010·cs.LG·November 11, 2025

MoSKA: Mixture of Shared KV Attention for Efficient Long-Sequence LLM Inference

Myunghyun Rhee, Sookyung Choi, Euiseok Kim, Joonseop Sim, Youngpyo Joo, Hoshik Kim

PDF

Open Access

TL;DR

MoSKA introduces a novel shared key-value attention mechanism and hardware optimizations to significantly improve the efficiency and scalability of long-sequence LLM inference, especially with high context sharing.

Contribution

The paper proposes MoSKA, a new architecture combining shared KV attention, sparse attention pruning, and specialized hardware to enhance LLM inference performance.

Findings

01

Achieves up to 538.7x throughput increase over baselines.

02

Effectively handles high context sharing in LLM workloads.

03

Transforms shared data attention into compute-bound GEMM operations.

Abstract

The escalating context length in Large Language Models (LLMs) creates a severe performance bottleneck around the Key-Value (KV) cache, whose memory-bound nature leads to significant GPU under-utilization. This paper introduces Mixture of Shared KV Attention (MoSKA), an architecture that addresses this challenge by exploiting the heterogeneity of context data. It differentiates between per-request unique and massively reused shared sequences. The core of MoSKA is a novel Shared KV Attention mechanism that transforms the attention on shared data from a series of memory-bound GEMV operations into a single, compute-bound GEMM by batching concurrent requests. This is supported by an MoE-inspired sparse attention strategy that prunes the search space and a tailored Disaggregated Infrastructure that specializes hardware for unique and shared data. This comprehensive approach demonstrates a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Natural Language Processing Techniques · Big Data and Digital Economy