Hybrid Dynamic Pruning: A Pathway to Efficient Transformer Inference

Ghadeer Jaradat; Mohammed Tolba; Ghada Alsuhli; Hani Saleh; Mahmoud; Al-Qutayri; Thanos Stouraitis; Baker Mohammad

arXiv:2407.12893·cs.LG·July 19, 2024

Hybrid Dynamic Pruning: A Pathway to Efficient Transformer Inference

Ghadeer Jaradat, Mohammed Tolba, Ghada Alsuhli, Hani Saleh, Mahmoud, Al-Qutayri, Thanos Stouraitis, Baker Mohammad

PDF

Open Access

TL;DR

This paper introduces Hybrid Dynamic Pruning (HDP), a novel approach combining pruning and approximation techniques with a specialized co-processor to make Transformer inference more efficient on edge devices.

Contribution

It presents a new co-designed algorithm-architecture approach that prunes attention heads and blocks dynamically at runtime, reducing computation and memory usage.

Findings

01

Reduces attention computation by pruning unimportant heads and blocks.

02

Achieves lower latency and power consumption in Transformer inference.

03

Demonstrates effectiveness on edge devices with improved efficiency.

Abstract

In the world of deep learning, Transformer models have become very significant, leading to improvements in many areas from understanding language to recognizing images, covering a wide range of applications. Despite their success, the deployment of these models in real-time applications, particularly on edge devices, poses significant challenges due to their quadratic computational intensity and memory demands. To overcome these challenges we introduce a novel Hybrid Dynamic Pruning (HDP), an efficient algorithm-architecture co-design approach that accelerates transformers using head sparsity, block sparsity and approximation opportunities to reduce computations in attention and reduce memory access. With the observation of the huge redundancy in attention scores and attention heads, we propose a novel integer-based row-balanced block pruning to prune unimportant blocks in the attention…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPower Transformer Diagnostics and Insulation

MethodsResidual Connection · Byte Pair Encoding · Layer Normalization · Label Smoothing · Linear Layer · Adam · Dropout · Multi-Head Attention · Dense Connections · Softmax