Hybrid Dynamic Pruning: A Pathway to Efficient Transformer Inference
Ghadeer Jaradat, Mohammed Tolba, Ghada Alsuhli, Hani Saleh, Mahmoud, Al-Qutayri, Thanos Stouraitis, Baker Mohammad

TL;DR
This paper introduces Hybrid Dynamic Pruning (HDP), a novel approach combining pruning and approximation techniques with a specialized co-processor to make Transformer inference more efficient on edge devices.
Contribution
It presents a new co-designed algorithm-architecture approach that prunes attention heads and blocks dynamically at runtime, reducing computation and memory usage.
Findings
Reduces attention computation by pruning unimportant heads and blocks.
Achieves lower latency and power consumption in Transformer inference.
Demonstrates effectiveness on edge devices with improved efficiency.
Abstract
In the world of deep learning, Transformer models have become very significant, leading to improvements in many areas from understanding language to recognizing images, covering a wide range of applications. Despite their success, the deployment of these models in real-time applications, particularly on edge devices, poses significant challenges due to their quadratic computational intensity and memory demands. To overcome these challenges we introduce a novel Hybrid Dynamic Pruning (HDP), an efficient algorithm-architecture co-design approach that accelerates transformers using head sparsity, block sparsity and approximation opportunities to reduce computations in attention and reduce memory access. With the observation of the huge redundancy in attention scores and attention heads, we propose a novel integer-based row-balanced block pruning to prune unimportant blocks in the attention…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPower Transformer Diagnostics and Insulation
MethodsResidual Connection · Byte Pair Encoding · Layer Normalization · Label Smoothing · Linear Layer · Adam · Dropout · Multi-Head Attention · Dense Connections · Softmax
