An Algorithm-Hardware Co-Optimized Framework for Accelerating N:M Sparse Transformers
Chao Fang, Aojun Zhou, Zhongfeng Wang

TL;DR
This paper introduces a co-optimized algorithm-hardware framework that leverages N:M sparsity to accelerate Transformer models efficiently, achieving significant speedups and accuracy improvements over existing methods.
Contribution
It proposes a novel sparsity inheritance and dynamic pruning method along with a flexible hardware architecture for efficient N:M sparse Transformer acceleration.
Findings
Achieves 6.7% accuracy improvement with efficient training.
Realizes up to 14.47x speedup over Intel i9-9900X.
Outperforms FPGA-based accelerators by up to 19.47x in inference speed.
Abstract
The Transformer has been an indispensable staple in deep learning. However, for real-life applications, it is very challenging to deploy efficient Transformers due to immense parameters and operations of models. To relieve this burden, exploiting sparsity is an effective approach to accelerate Transformers. Newly emerging Ampere GPUs leverage a 2:4 sparsity pattern to achieve model acceleration, while it can hardly meet the diverse algorithm and hardware constraints when deploying models. By contrast, we propose an algorithm-hardware co-optimized framework to flexibly and efficiently accelerate Transformers by utilizing general N:M sparsity patterns. (1) From algorithm perspective, we propose a sparsity inheritance mechanism along with an inherited dynamic pruning (IDP) method to obtain a series of N:M sparse candidate Transformers rapidly. A model compression scheme is further proposed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Pruning · Linear Layer · Dense Connections · Absolute Position Encodings · Label Smoothing · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Layer Normalization
