$\pi$-Attention: Periodic Sparse Transformers for Efficient Long-Context Modeling
Dong Liu, Yanxuan Yu

TL;DR
The paper introduces Attention, a periodic sparse Transformer that efficiently models long contexts by combining ring-local neighborhoods, deterministic skips, and adaptive fusion, achieving comparable or better performance with reduced computational costs.
Contribution
It proposes a novel periodic sparse attention mechanism that improves long-range modeling efficiency and effectiveness over existing sparse attention methods.
Findings
Attention achieves (50%) fewer GPUs with similar or better performance.
It attains (8.3%) lower perplexity than RingAttention in language modeling.
The model's design enables predictable coverage and adaptive fusion for long-context tasks.
Abstract
Transformers have revolutionized natural language processing, but their quadratic complexity with respect to sequence length remains a fundamental bottleneck for long-range modeling. While sparse attention mechanisms like RingAttention reduce computational costs by restricting attention to local neighborhoods, they suffer from limited receptive fields and lack of adaptability. We present \PiAttention, a periodic sparse Transformer that factorizes attention into ring-local neighborhoods, deterministic -stride skips, and an adaptive fusion gate. The periodic structure provides predictable coverage of distant tokens, while the sparse footprint keeps the per-layer complexity linear in context length. We prove that \PiAttention achieves receptive field growth compared to for RingAttention, where is the local window size, is the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
