SWAA: Sliding Window Attention Adaptation for Efficient and Quality Preserving Long Context Processing

Yijiong Yu; Jiale Liu; Qingyun Wu; Huazheng Wang; Ji Pei

arXiv:2512.10411·cs.CL·March 27, 2026

SWAA: Sliding Window Attention Adaptation for Efficient and Quality Preserving Long Context Processing

Yijiong Yu, Jiale Liu, Qingyun Wu, Huazheng Wang, Ji Pei

PDF

Open Access 1 Models 2 Datasets

TL;DR

SWAA introduces a set of adaptable techniques to enable efficient long context processing in Transformer models by combining sliding window attention with strategies to mitigate structural and training mismatches, achieving significant speedups with maintained quality.

Contribution

The paper presents SWAA, a versatile toolkit that adapts full attention models to sliding window attention without extensive pretraining, improving long context inference efficiency.

Findings

01

Achieves 30% to 100% speedups in long context inference.

02

Effectively recovers long context performance with specific strategy combinations.

03

Provides a flexible framework adaptable to various computational scenarios.

Abstract

The quadratic complexity of self attention in Transformer based LLMs renders long context inference prohibitively expensive. While Sliding Window Attention (SWA), the simplest sparse attention pattern, offers a linear complexity alternative, it suffers from catastrophic long context performance collapse, which stems from two fundamental factors: the training inference mismatch when naively applying SWA to models pretrained with Full Attention (FA), and the inherent structural inability to access distant information when applying SWA to every module at all times. To address these dual challenges, we propose Sliding Window Attention Adaptation (SWAA), a plug and play toolkit of recipes that adapts FA models to SWA without costly pretraining. SWAA systematically combines four core strategies to tackle these distinct issues: (1) Full Attention (FA) Decode and (2) Interleaving FA and SWA…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
yuyijiong/Qwen3-SWA-adaptation
model· ♡ 5
♡ 5

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning in Healthcare · Generative Adversarial Networks and Image Synthesis