Accelerating Attention through Gradient-Based Learned Runtime Pruning

Zheng Li; Soroush Ghodrati; Amir Yazdanbakhsh; Hadi; Esmaeilzadeh; Mingu Kang

arXiv:2204.03227·cs.CL·April 18, 2022

Accelerating Attention through Gradient-Based Learned Runtime Pruning

Zheng Li, Soroush Ghodrati, Amir Yazdanbakhsh, Hadi, Esmaeilzadeh, Mingu Kang

PDF

Open Access

TL;DR

This paper introduces a gradient-based method to dynamically prune low-impact attention scores in transformer models, significantly accelerating computation while maintaining accuracy.

Contribution

It proposes a differentiable regularizer to optimize attention pruning thresholds during training and a specialized bit-serial architecture for efficient implementation.

Findings

01

Achieves 1.9x speedup and 3.9x energy reduction on average.

02

Maintains accuracy within 0.2% degradation across multiple models.

03

Demonstrates effectiveness on 43 diverse NLP and vision tasks.

Abstract

Self-attention is a key enabler of state-of-art accuracy for various transformer-based Natural Language Processing models. This attention mechanism calculates a correlation score for each word with respect to the other words in a sentence. Commonly, only a small subset of words highly correlates with the word under attention, which is only determined at runtime. As such, a significant amount of computation is inconsequential due to low attention scores and can potentially be pruned. The main challenge is finding the threshold for the scores below which subsequent computation will be inconsequential. Although such a threshold is discrete, this paper formulates its search through a soft differentiable regularizer integrated into the loss function of the training. This formulation piggy backs on the back-propagation training to analytically co-optimize the threshold and the weights…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Advanced Neural Network Applications · Ferroelectric and Negative Capacitance Devices

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Discriminative Fine-Tuning · Cosine Annealing · Refunds@Expedia|||How do I get a full refund from Expedia? · Adam · Residual Connection · Byte Pair Encoding · Dense Connections