$\pi$-Attention: Periodic Sparse Transformers for Efficient Long-Context Modeling

Dong Liu; Yanxuan Yu

arXiv:2511.10696·cs.CL·March 31, 2026

$\pi$-Attention: Periodic Sparse Transformers for Efficient Long-Context Modeling

Dong Liu, Yanxuan Yu

PDF

TL;DR

The paper introduces Attention, a periodic sparse Transformer that efficiently models long contexts by combining ring-local neighborhoods, deterministic skips, and adaptive fusion, achieving comparable or better performance with reduced computational costs.

Contribution

It proposes a novel periodic sparse attention mechanism that improves long-range modeling efficiency and effectiveness over existing sparse attention methods.

Findings

01

Attention achieves (50%) fewer GPUs with similar or better performance.

02

It attains (8.3%) lower perplexity than RingAttention in language modeling.

03

The model's design enables predictable coverage and adaptive fusion for long-context tasks.

Abstract

Transformers have revolutionized natural language processing, but their quadratic complexity with respect to sequence length remains a fundamental bottleneck for long-range modeling. While sparse attention mechanisms like RingAttention reduce computational costs by restricting attention to local neighborhoods, they suffer from limited receptive fields and lack of adaptability. We present \PiAttention, a periodic sparse Transformer that factorizes attention into ring-local neighborhoods, deterministic $π$ -stride skips, and an adaptive fusion gate. The periodic structure provides predictable coverage of distant tokens, while the sparse footprint keeps the per-layer complexity linear in context length. We prove that \PiAttention achieves $O (k L + π lo g L)$ receptive field growth compared to $O (k L)$ for RingAttention, where $k$ is the local window size, $π$ is the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.