RATTENTION: Towards the Minimal Sliding Window Size in Local-Global Attention Models
Bailin Wang, Chang Lan, Chong Wang, Ruoming Pang

TL;DR
This paper introduces RATTENTION, a local-global attention variant that uses a specialized linear attention mechanism to effectively reduce window size while maintaining performance, improving efficiency especially in short-context scenarios.
Contribution
We propose RATTENTION, a novel local-global attention method that captures out-of-window tokens, enabling smaller window sizes without performance loss, and demonstrate its effectiveness at large scales.
Findings
RATTENTION with a window size of 512 matches full-attention performance.
Achieves better efficiency-performance tradeoff at 3B and 12B scales.
Maintains training speed comparable to existing methods.
Abstract
Local-global attention models have recently emerged as compelling alternatives to standard Transformers, promising improvements in both training and inference efficiency. However, the crucial choice of window size presents a Pareto tradeoff: larger windows maintain performance akin to full attention but offer minimal efficiency gains in short-context scenarios, while smaller windows can lead to performance degradation. Current models, such as Gemma2 and Mistral, adopt conservative window sizes (e.g., 4096 out of an 8192 pretraining length) to preserve performance. This work investigates strategies to shift this Pareto frontier, enabling local-global models to achieve efficiency gains even in short-context regimes. Our core motivation is to address the intrinsic limitation of local attention -- its complete disregard for tokens outside the defined window. We explore RATTENTION, a variant…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Advanced Data Compression Techniques
