S2O: Early Stopping for Sparse Attention via Online Permutation

Yu Zhang; Songwei Liu; Chenqian Yan; Sheng Lin; Beichen Ning; Fangmin Chen; Xing Wang

arXiv:2602.22575·cs.LG·May 6, 2026

S2O: Early Stopping for Sparse Attention via Online Permutation

Yu Zhang, Songwei Liu, Chenqian Yan, Sheng Lin, Beichen Ning, Fangmin Chen, Xing Wang

PDF

TL;DR

S2O introduces an online permutation and early stopping mechanism for sparse attention, significantly improving efficiency and sparsity in long-context inference without sacrificing accuracy.

Contribution

It proposes a novel importance-guided online permutation and early stopping method to enhance sparse attention efficiency beyond existing block-based approaches.

Findings

01

Reduces single-operator MSE by 3.82× at matched sparsity

02

Achieves 7.51× attention speedup and 3.81× end-to-end speedup

03

Decreases prefill compute density by 3.31× at matched MSE

Abstract

Attention scales quadratically with sequence length, fundamentally limiting long-context inference. Existing block-granularity sparsification can reduce latency, but coarse blocks impose an intrinsic sparsity ceiling, making further improvements difficult even with carefully engineered designs. We present S2O, which performs early stopping for sparse attention via online permutation. Inspired by virtual-to-physical address mapping in memory systems, S2O revisits and factorizes FlashAttention execution, enabling inference to load non-contiguous tokens rather than a contiguous span in the original order. Motivated by fine-grained structures in attention heatmaps, we transform explicit permutation into an online, index-guided, discrete loading policy; with extremely lightweight preprocessing and index-remapping overhead, it concentrates importance on a small set of high-priority blocks.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.