SATA: Sparsity-Aware Scheduling for Selective Token Attention

Zhenkun Fan; Zishen Wan; Che-Kai Liu; Ashwin Sanjay Lele; Win-San Khwa; Bo Zhang; Meng-Fan Chang; Arijit Raychowdhury

arXiv:2601.20267·cs.AR·January 29, 2026

SATA: Sparsity-Aware Scheduling for Selective Token Attention

Zhenkun Fan, Zishen Wan, Che-Kai Liu, Ashwin Sanjay Lele, Win-San Khwa, Bo Zhang, Meng-Fan Chang, Arijit Raychowdhury

PDF

Open Access

TL;DR

SATA introduces a dynamic scheduling scheme that enhances the efficiency of selective token attention in transformers by improving throughput and energy efficiency through better data locality management.

Contribution

It proposes a locality-centric dynamic scheduling method for sparse attention, optimizing hardware utilization and performance in transformer models.

Findings

01

System throughput increased by up to 1.76x

02

Energy efficiency improved by 2.94x

03

Minimal scheduling overhead observed

Abstract

Transformers have become the foundation of numerous state-of-the-art AI models across diverse domains, thanks to their powerful attention mechanism for modeling long-range dependencies. However, the quadratic scaling complexity of attention poses significant challenges for efficient hardware implementation. While techniques such as quantization and pruning help mitigate this issue, selective token attention offers a promising alternative by narrowing the attention scope to only the most relevant tokens, reducing computation and filtering out noise. In this work, we propose SATA, a locality-centric dynamic scheduling scheme that proactively manages sparsely distributed access patterns from selective Query-Key operations. By reordering operand flow and exploiting data locality, our approach enables early fetch and retirement of intermediate Query/Key vectors, improving system…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Big Data and Digital Economy · Advanced Neural Network Applications