A Unified Pruning Framework for Vision Transformers
Hao Yu, Jianxin Wu

TL;DR
This paper introduces a unified structural pruning framework for vision transformers that maintains model integrity and improves efficiency without breaking spatial structure, leading to better accuracy and performance on vision tasks.
Contribution
The paper presents UP-ViTs, a novel unified pruning method for vision transformers that preserves model structure and enhances accuracy across various ViT architectures.
Findings
Achieves 75.79% accuracy on ImageNet with UP-DeiT-T, outperforming vanilla DeiT-T.
Improves PVTv2-B0 accuracy by 4.83% on ImageNet.
Maintains token representation consistency and improves object detection performance.
Abstract
Recently, vision transformer (ViT) and its variants have achieved promising performances in various computer vision tasks. Yet the high computational costs and training data requirements of ViTs limit their application in resource-constrained settings. Model compression is an effective method to speed up deep learning models, but the research of compressing ViTs has been less explored. Many previous works concentrate on reducing the number of tokens. However, this line of attack breaks down the spatial structure of ViTs and is hard to be generalized into downstream tasks. In this paper, we design a unified framework for structural pruning of both ViTs and its variants, namely UP-ViTs. Our method focuses on pruning all ViTs components while maintaining the consistency of the model structure. Abundant experimental results show that our method can achieve high accuracy on compressed ViTs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Visual Attention and Saliency Detection
MethodsMulti-Head Attention · Attention Is All You Need · Pruning · Linear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Softmax · Residual Connection · Dense Connections · Layer Normalization · Vision Transformer
