S2O: Early Stopping for Sparse Attention via Online Permutation
Yu Zhang, Songwei Liu, Chenqian Yan, Sheng Lin, Beichen Ning, Fangmin Chen, Xing Wang

TL;DR
S2O introduces an online permutation and early stopping mechanism for sparse attention, significantly improving efficiency and sparsity in long-context inference without sacrificing accuracy.
Contribution
It proposes a novel importance-guided online permutation and early stopping method to enhance sparse attention efficiency beyond existing block-based approaches.
Findings
Reduces single-operator MSE by 3.82× at matched sparsity
Achieves 7.51× attention speedup and 3.81× end-to-end speedup
Decreases prefill compute density by 3.31× at matched MSE
Abstract
Attention scales quadratically with sequence length, fundamentally limiting long-context inference. Existing block-granularity sparsification can reduce latency, but coarse blocks impose an intrinsic sparsity ceiling, making further improvements difficult even with carefully engineered designs. We present S2O, which performs early stopping for sparse attention via online permutation. Inspired by virtual-to-physical address mapping in memory systems, S2O revisits and factorizes FlashAttention execution, enabling inference to load non-contiguous tokens rather than a contiguous span in the original order. Motivated by fine-grained structures in attention heatmaps, we transform explicit permutation into an online, index-guided, discrete loading policy; with extremely lightweight preprocessing and index-remapping overhead, it concentrates importance on a small set of high-priority blocks.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
