PPT: Token Pruning and Pooling for Efficient Vision Transformers
Xinjian Wu, Fanhu Zeng, Xiudong Wang, Xinghao Chen

TL;DR
PPT introduces a combined token pruning and pooling framework for Vision Transformers, significantly reducing computational cost while preserving accuracy, thus enhancing efficiency for practical applications.
Contribution
The paper presents a novel, parameter-free framework that adaptively combines token pruning and pooling to reduce redundancy in ViTs, improving efficiency without accuracy loss.
Findings
Reduces over 37% FLOPs on DeiT-S
Increases throughput by over 45%
Maintains accuracy on ImageNet dataset
Abstract
Vision Transformers (ViTs) have emerged as powerful models in the field of computer vision, delivering superior performance across various vision tasks. However, the high computational complexity poses a significant barrier to their practical applications in real-world scenarios. Motivated by the fact that not all tokens contribute equally to the final predictions and fewer tokens bring less computational cost, reducing redundant tokens has become a prevailing paradigm for accelerating vision transformers. However, we argue that it is not optimal to either only reduce inattentive redundancy by token pruning, or only reduce duplicative redundancy by token merging. To this end, in this paper we propose a novel acceleration framework, namely token Pruning & Pooling Transformers (PPT), to adaptively tackle these two types of redundancy in different layers. By heuristically integrating both…
Peer Reviews
Decision·ICLR 2024 Conference Withdrawn Submission
1. The proposed method is simple yet effective, without additional trainable parameters, and can be easily incorporated into the standard transformer block. 2. The results demonstrate a better accuracy-compression ratio trade-off than the previous methods.
1. Missing a pretty relevant reference: Unified Vision Transformer Compression. ICLR 2022 Please cite and compare with it on Deit-S, Deit-B, and Deit-Tiny. 2. The backbones used in the paper are Deit and LV-VIT. How about the Swin-Transformer, which also already has a patch merging module for each stage of blocks? I am curious about the generalization of the proposed mechanism in this kind of structure and its performance.
The strength of this paper is easy following. The figures help a lot for understanding the story. The experiments are good even though not better compared with some recent papers.
The weakness of this paper are as follows: 1. The inference is hard to implement to be really act as what the authors have claimed. I do not think the throughputs are experimental numbers but theoretical numbers. This is the common issue for adaptive token pruning methods, such as A-ViT. I think in practical this paper is useless, not only not decreasing the real computation cost but increasing the cost. I do not think the codes would be released for inference. 2. The results are not promising
1 How to achieve accuracy-computation balance is a critical problem for ViT. 2 The propsoed method seems to be sound. 3 The paper is written well.
1 The novelty is relatively limited. The token pruning or token pooling has been widely investigated in the literature [DynamicViT, Evo-ViT, Self-Slim ViT, etc]. The adaptive design used in the paper is simply the combination of both pruning and pooling without much insightful modifications. 2 The results are actually comparable with the state of the art methods, with a little improvement. As shown in Table 1, the proposed method is actually comparable to Evo-ViT, in terms of all the DeiT mod
- The motivation for redundancy and duplicative redundancy should be handled differently across different layers is clear. - Well-written and easy to follow. - This method can work off-the-shell without finetuning.
- Lack of discussion and comparison with the highly related work "joint Token Pruning & Squeezing (TPS) "[1], whose method is doing the pruning and pooling(squeezing/merging) at the same time in a non-adaptive way, which should be a good and important baseline for this adaptive strategy. - In the small network series such as Deit-S, the performance is 0.3% behind TPS[1] in a comparable FLOPS budget, I think the performance should be higher to justify the benefit of the adaptive strategy. It wou
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Visual Attention and Saliency Detection · CCD and CMOS Imaging Sensors
MethodsPruning
