PADE: A Predictor-Free Sparse Attention Accelerator via Unified Execution and Stage Fusion

Huizheng Wang; Hongbin Wang; Zichuan Wang; Zhiheng Yue; Yang Wang; Chao Li; Yang Hu; Shouyi Yin

arXiv:2512.14322·cs.AR·January 13, 2026

PADE: A Predictor-Free Sparse Attention Accelerator via Unified Execution and Stage Fusion

Huizheng Wang, Hongbin Wang, Zichuan Wang, Zhiheng Yue, Yang Wang, Chao Li, Yang Hu, Shouyi Yin

PDF

Open Access

TL;DR

PADE introduces a predictor-free sparse attention accelerator that combines novel algorithms and hardware design to significantly improve speed and energy efficiency in attention-based models, eliminating the need for costly sparsity predictors.

Contribution

It proposes a unified, predictor-free approach with innovative techniques like BUI-GF, BS-OOE, and ISTA for efficient sparse attention acceleration in hardware.

Findings

01

Achieves 7.43x speedup over Nvidia H100 GPU

02

31.1x higher energy efficiency than Nvidia H100

03

Outperforms state-of-the-art accelerators in energy savings

Abstract

Attention-based models have revolutionized AI, but the quadratic cost of self-attention incurs severe computational and memory overhead. Sparse attention methods alleviate this by skipping low-relevance token pairs. However, current approaches lack practicality due to the heavy expense of added sparsity predictor, which severely drops their hardware efficiency. This paper advances the state-of-the-art (SOTA) by proposing a bit-serial enable stage-fusion (BSF) mechanism, which eliminates the need for a separate predictor. However, it faces key challenges: 1) Inaccurate bit-sliced sparsity speculation leads to incorrect pruning; 2) Hardware under-utilization due to fine-grained and imbalanced bit-level workloads. 3) Tiling difficulty caused by the row-wise dependency in sparsity pruning criteria. We propose PADE, a predictor-free algorithm-hardware co-design for dynamic sparse…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Low-power high-performance VLSI design