A Length Adaptive Algorithm-Hardware Co-design of Transformer on FPGA   Through Sparse Attention and Dynamic Pipelining

Hongwu Peng; Shaoyi Huang; Shiyang Chen; Bingbing Li; Tong Geng; Ang; Li; Weiwen Jiang; Wujie Wen; Jinbo Bi; Hang Liu; Caiwen Ding

arXiv:2208.03646·cs.LG·August 23, 2022

A Length Adaptive Algorithm-Hardware Co-design of Transformer on FPGA Through Sparse Attention and Dynamic Pipelining

Hongwu Peng, Shaoyi Huang, Shiyang Chen, Bingbing Li, Tong Geng, Ang, Li, Weiwen Jiang, Wujie Wen, Jinbo Bi, Hang Liu, Caiwen Ding

PDF

TL;DR

This paper presents a length-adaptive FPGA co-design for Transformers that uses sparse attention and dynamic pipelining to significantly improve speed and energy efficiency for NLP tasks.

Contribution

It introduces a hardware-friendly sparse attention operator and a length-aware resource scheduling algorithm for FPGA-based Transformer acceleration.

Findings

01

80.2× faster than CPU

02

2.6× faster than GPU

03

4× higher energy efficiency than GPU

Abstract

Transformers are considered one of the most important deep learning models since 2018, in part because it establishes state-of-the-art (SOTA) records and could potentially replace existing Deep Neural Networks (DNNs). Despite the remarkable triumphs, the prolonged turnaround time of Transformer models is a widely recognized roadblock. The variety of sequence lengths imposes additional computing overhead where inputs need to be zero-padded to the maximum sentence length in the batch to accommodate the parallel computing platforms. This paper targets the field-programmable gate array (FPGA) and proposes a coherent sequence length adaptive algorithm-hardware co-design for Transformer acceleration. Particularly, we develop a hardware-friendly sparse attention operator and a length-aware hardware resource scheduling algorithm. The proposed sparse attention operator brings the complexity of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Softmax · Adam · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Label Smoothing · Layer Normalization