Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention Graph in Pre-Trained Transformers
Hongjie Wang, Bhishma Dedhia, Niraj K. Jha

TL;DR
Zero-TPrune introduces a zero-shot token pruning method for pre-trained Transformers that leverages the attention graph to efficiently reduce computational costs without fine-tuning, suitable for edge deployment.
Contribution
It is the first zero-shot token pruning approach that uses attention graphs and a novel Weighted Page Rank algorithm, eliminating the need for fine-tuning.
Findings
Reduces FLOPs of vision Transformers by 34.7% without fine-tuning.
Improves throughput by 45.3% with only 0.4% accuracy loss.
Outperforms state-of-the-art fine-tuning-free pruning methods in accuracy retention.
Abstract
Deployment of Transformer models on edge devices is becoming increasingly challenging due to the exponentially growing inference cost that scales quadratically with the number of tokens in the input sequence. Token pruning is an emerging solution to address this challenge due to its ease of deployment on various Transformer backbones. However, most token pruning methods require computationally expensive fine-tuning, which is undesirable in many edge deployment cases. In this work, we propose Zero-TPrune, the first zero-shot method that considers both the importance and similarity of tokens in performing token pruning. It leverages the attention graph of pre-trained Transformer models to produce an importance distribution for tokens via our proposed Weighted Page Rank (WPR) algorithm. This distribution further guides token partitioning for efficient similarity-based pruning. Due to the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Visual Attention and Saliency Detection · Advanced Memory and Neural Computing
MethodsMulti-Head Attention · Attention Is All You Need · Pruning · Softmax · Layer Normalization · Byte Pair Encoding · Dropout · Linear Layer · Label Smoothing · Vision Transformer
