Mixture of Attention Spans: Optimizing LLM Inference Efficiency with Heterogeneous Sliding-Window Lengths

Tianyu Fu; Haofeng Huang; Xuefei Ning; Genghan Zhang; Boju Chen; Tianqi Wu; Hongyi Wang; Zixiao Huang; Shiyao Li; Shengen Yan; Guohao Dai; Huazhong Yang; Yu Wang

arXiv:2406.14909·cs.LG·November 26, 2025·2 cites

Mixture of Attention Spans: Optimizing LLM Inference Efficiency with Heterogeneous Sliding-Window Lengths

Tianyu Fu, Haofeng Huang, Xuefei Ning, Genghan Zhang, Boju Chen, Tianqi Wu, Hongyi Wang, Zixiao Huang, Shiyao Li, Shengen Yan, Guohao Dai, Huazhong Yang, Yu Wang

PDF

Open Access 1 Repo

TL;DR

The paper introduces MoA, a method that automatically tailors sliding-window attention spans in LLMs, significantly improving efficiency and accuracy in long-context scenarios by customizing attention per head and layer.

Contribution

MoA is the first approach to optimize heterogeneous attention spans across heads and layers, enhancing long-context performance and efficiency in LLM inference.

Findings

01

Increases effective context length by 3.9x with same window size.

02

Boosts retrieval accuracy by 1.5-7.1x over uniform-window baseline.

03

Reduces GPU memory usage by 1.2-1.4x and improves throughput by 6.6-8.2x.

Abstract

Sliding-window attention offers a hardware-efficient solution to the memory and throughput challenges of Large Language Models (LLMs) in long-context scenarios. Existing methods typically employ a single window length across all attention heads and input sizes. However, this uniform approach fails to capture the heterogeneous attention patterns inherent in LLMs, ignoring their distinct accuracy-latency trade-offs. To address this challenge, we propose *Mixture of Attention Spans* (MoA), which automatically tailors distinct sliding-window length configurations to different heads and layers. MoA constructs and navigates a search space of various window lengths and their scaling rules relative to input sizes. It profiles the model, evaluates potential configurations, and pinpoints the optimal length configurations for each head. MoA adapts to varying input sizes, revealing that some…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thu-nics/moa
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsSoftmax · Attention Is All You Need · Focus