Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention   Graph in Pre-Trained Transformers

Hongjie Wang; Bhishma Dedhia; Niraj K. Jha

arXiv:2305.17328·cs.CV·April 9, 2024·2 cites

Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention Graph in Pre-Trained Transformers

Hongjie Wang, Bhishma Dedhia, Niraj K. Jha

PDF

Open Access

TL;DR

Zero-TPrune introduces a zero-shot token pruning method for pre-trained Transformers that leverages the attention graph to efficiently reduce computational costs without fine-tuning, suitable for edge deployment.

Contribution

It is the first zero-shot token pruning approach that uses attention graphs and a novel Weighted Page Rank algorithm, eliminating the need for fine-tuning.

Findings

01

Reduces FLOPs of vision Transformers by 34.7% without fine-tuning.

02

Improves throughput by 45.3% with only 0.4% accuracy loss.

03

Outperforms state-of-the-art fine-tuning-free pruning methods in accuracy retention.

Abstract

Deployment of Transformer models on edge devices is becoming increasingly challenging due to the exponentially growing inference cost that scales quadratically with the number of tokens in the input sequence. Token pruning is an emerging solution to address this challenge due to its ease of deployment on various Transformer backbones. However, most token pruning methods require computationally expensive fine-tuning, which is undesirable in many edge deployment cases. In this work, we propose Zero-TPrune, the first zero-shot method that considers both the importance and similarity of tokens in performing token pruning. It leverages the attention graph of pre-trained Transformer models to produce an importance distribution for tokens via our proposed Weighted Page Rank (WPR) algorithm. This distribution further guides token partitioning for efficient similarity-based pruning. Due to the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Visual Attention and Saliency Detection · Advanced Memory and Neural Computing

MethodsMulti-Head Attention · Attention Is All You Need · Pruning · Softmax · Layer Normalization · Byte Pair Encoding · Dropout · Linear Layer · Label Smoothing · Vision Transformer