Designing Spatial Architectures for Sparse Attention: STAR Accelerator via Cross-Stage Tiling

Huizheng Wang; Taiquan Wei; Hongbin Wang; Zichuan Wang; Xinru Tang; Zhiheng Yue; Shaojun Wei; Yang Hu; Shouyi Yin

arXiv:2512.20198·cs.AR·December 25, 2025

Designing Spatial Architectures for Sparse Attention: STAR Accelerator via Cross-Stage Tiling

Huizheng Wang, Taiquan Wei, Hongbin Wang, Zichuan Wang, Xinru Tang, Zhiheng Yue, Shaojun Wei, Yang Hu, Shouyi Yin

PDF

Open Access

TL;DR

This paper introduces STAR, a novel accelerator and algorithm co-design for sparse attention in large language models, achieving significant speedup and energy efficiency improvements through cross-stage coordination and optimized memory access.

Contribution

The paper presents STAR, a cross-stage optimized accelerator and algorithm for sparse attention, enabling efficient Transformer inference under large token parallelism.

Findings

01

Up to 9.2× speedup over A100

02

71.2× energy efficiency improvement

03

20.1× throughput increase in spatial architecture

Abstract

Large language models (LLMs) rely on self-attention for contextual understanding, demanding high-throughput inference and large-scale token parallelism (LTPP). Existing dynamic sparsity accelerators falter under LTPP scenarios due to stage-isolated optimizations. Revisiting the end-to-end sparsity acceleration flow, we identify an overlooked opportunity: cross-stage coordination can substantially reduce redundant computation and memory access. We propose STAR, a cross-stage compute- and memory-efficient algorithm-hardware co-design tailored for Transformer inference under LTPP. STAR introduces a leading-zero-based sparsity prediction using log-domain add-only operations to minimize prediction overhead. It further employs distributed sorting and a sorted updating FlashAttention mechanism, guided by a coordinated tiling strategy that enables fine-grained stage interaction for improved…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Embedded Systems Design Techniques