Accelerating Attention through Gradient-Based Learned Runtime Pruning
Zheng Li, Soroush Ghodrati, Amir Yazdanbakhsh, Hadi, Esmaeilzadeh, Mingu Kang

TL;DR
This paper introduces a gradient-based method to dynamically prune low-impact attention scores in transformer models, significantly accelerating computation while maintaining accuracy.
Contribution
It proposes a differentiable regularizer to optimize attention pruning thresholds during training and a specialized bit-serial architecture for efficient implementation.
Findings
Achieves 1.9x speedup and 3.9x energy reduction on average.
Maintains accuracy within 0.2% degradation across multiple models.
Demonstrates effectiveness on 43 diverse NLP and vision tasks.
Abstract
Self-attention is a key enabler of state-of-art accuracy for various transformer-based Natural Language Processing models. This attention mechanism calculates a correlation score for each word with respect to the other words in a sentence. Commonly, only a small subset of words highly correlates with the word under attention, which is only determined at runtime. As such, a significant amount of computation is inconsequential due to low attention scores and can potentially be pruned. The main challenge is finding the threshold for the scores below which subsequent computation will be inconsequential. Although such a threshold is discrete, this paper formulates its search through a soft differentiable regularizer integrated into the loss function of the training. This formulation piggy backs on the back-propagation training to analytically co-optimize the threshold and the weights…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Neural Network Applications · Ferroelectric and Negative Capacitance Devices
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Discriminative Fine-Tuning · Cosine Annealing · Refunds@Expedia|||How do I get a full refund from Expedia? · Adam · Residual Connection · Byte Pair Encoding · Dense Connections
