An Algorithm-Hardware Co-Optimized Framework for Accelerating N:M Sparse   Transformers

Chao Fang; Aojun Zhou; Zhongfeng Wang

arXiv:2208.06118·cs.AR·November 1, 2022

An Algorithm-Hardware Co-Optimized Framework for Accelerating N:M Sparse Transformers

Chao Fang, Aojun Zhou, Zhongfeng Wang

PDF

TL;DR

This paper introduces a co-optimized algorithm-hardware framework that leverages N:M sparsity to accelerate Transformer models efficiently, achieving significant speedups and accuracy improvements over existing methods.

Contribution

It proposes a novel sparsity inheritance and dynamic pruning method along with a flexible hardware architecture for efficient N:M sparse Transformer acceleration.

Findings

01

Achieves 6.7% accuracy improvement with efficient training.

02

Realizes up to 14.47x speedup over Intel i9-9900X.

03

Outperforms FPGA-based accelerators by up to 19.47x in inference speed.

Abstract

The Transformer has been an indispensable staple in deep learning. However, for real-life applications, it is very challenging to deploy efficient Transformers due to immense parameters and operations of models. To relieve this burden, exploiting sparsity is an effective approach to accelerate Transformers. Newly emerging Ampere GPUs leverage a 2:4 sparsity pattern to achieve model acceleration, while it can hardly meet the diverse algorithm and hardware constraints when deploying models. By contrast, we propose an algorithm-hardware co-optimized framework to flexibly and efficiently accelerate Transformers by utilizing general N:M sparsity patterns. (1) From algorithm perspective, we propose a sparsity inheritance mechanism along with an inherited dynamic pruning (IDP) method to obtain a series of N:M sparse candidate Transformers rapidly. A model compression scheme is further proposed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Is All You Need · Pruning · Linear Layer · Dense Connections · Absolute Position Encodings · Label Smoothing · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Layer Normalization