ESACT: An End-to-End Sparse Accelerator for Compute-Intensive Transformers via Local Similarity
Hongxiang Liu, Zhifang Deng, Tong Pu, Shengli Lu

TL;DR
This paper introduces ESACT, a novel end-to-end sparse accelerator for Transformers that leverages local similarity to significantly reduce computation and energy consumption while maintaining high accuracy.
Contribution
The paper proposes SPLS, a local similarity-based sparsity prediction mechanism, and architectural innovations enabling efficient end-to-end sparse acceleration of Transformers.
Findings
Reduces total computation by 52.03% with less than 1% accuracy loss.
Achieves 3.29 TOPS/W energy efficiency, outperforming state-of-the-art accelerators.
Improves attention-level energy efficiency by up to 2.95x.
Abstract
Transformers, composed of QKV generation, attention computation, and FFNs, have become the dominant model across various domains due to their outstanding performance. However, their high computational cost hinders efficient hardware deployment. Sparsity offers a promising solution, yet most existing accelerators exploit only intra-row sparsity in attention, while few consider inter-row sparsity. Approaches leveraging inter-row sparsity often rely on costly global similarity estimation, which diminishes the acceleration benefits of sparsity, and typically apply sparsity to only one or two transformer components. Through careful analysis of the attention distribution and computation flow, we observe that local similarity allows end-to-end sparse acceleration with lower computational overhead. Motivated by this observation, we propose ESACT, an end-to-end sparse…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Low-power high-performance VLSI design · Parallel Computing and Optimization Techniques
