PADE: A Predictor-Free Sparse Attention Accelerator via Unified Execution and Stage Fusion
Huizheng Wang, Hongbin Wang, Zichuan Wang, Zhiheng Yue, Yang Wang, Chao Li, Yang Hu, Shouyi Yin

TL;DR
PADE introduces a predictor-free sparse attention accelerator that combines novel algorithms and hardware design to significantly improve speed and energy efficiency in attention-based models, eliminating the need for costly sparsity predictors.
Contribution
It proposes a unified, predictor-free approach with innovative techniques like BUI-GF, BS-OOE, and ISTA for efficient sparse attention acceleration in hardware.
Findings
Achieves 7.43x speedup over Nvidia H100 GPU
31.1x higher energy efficiency than Nvidia H100
Outperforms state-of-the-art accelerators in energy savings
Abstract
Attention-based models have revolutionized AI, but the quadratic cost of self-attention incurs severe computational and memory overhead. Sparse attention methods alleviate this by skipping low-relevance token pairs. However, current approaches lack practicality due to the heavy expense of added sparsity predictor, which severely drops their hardware efficiency. This paper advances the state-of-the-art (SOTA) by proposing a bit-serial enable stage-fusion (BSF) mechanism, which eliminates the need for a separate predictor. However, it faces key challenges: 1) Inaccurate bit-sliced sparsity speculation leads to incorrect pruning; 2) Hardware under-utilization due to fine-grained and imbalanced bit-level workloads. 3) Tiling difficulty caused by the row-wise dependency in sparsity pruning criteria. We propose PADE, a predictor-free algorithm-hardware co-design for dynamic sparse…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Low-power high-performance VLSI design
