A Unified Pruning Framework for Vision Transformers

Hao Yu; Jianxin Wu

arXiv:2111.15127·cs.CV·December 1, 2021·5 cites

A Unified Pruning Framework for Vision Transformers

Hao Yu, Jianxin Wu

PDF

Open Access 1 Repo

TL;DR

This paper introduces a unified structural pruning framework for vision transformers that maintains model integrity and improves efficiency without breaking spatial structure, leading to better accuracy and performance on vision tasks.

Contribution

The paper presents UP-ViTs, a novel unified pruning method for vision transformers that preserves model structure and enhances accuracy across various ViT architectures.

Findings

01

Achieves 75.79% accuracy on ImageNet with UP-DeiT-T, outperforming vanilla DeiT-T.

02

Improves PVTv2-B0 accuracy by 4.83% on ImageNet.

03

Maintains token representation consistency and improves object detection performance.

Abstract

Recently, vision transformer (ViT) and its variants have achieved promising performances in various computer vision tasks. Yet the high computational costs and training data requirements of ViTs limit their application in resource-constrained settings. Model compression is an effective method to speed up deep learning models, but the research of compressing ViTs has been less explored. Many previous works concentrate on reducing the number of tokens. However, this line of attack breaks down the spatial structure of ViTs and is hard to be generalized into downstream tasks. In this paper, we design a unified framework for structural pruning of both ViTs and its variants, namely UP-ViTs. Our method focuses on pruning all ViTs components while maintaining the consistency of the model structure. Abundant experimental results show that our method can achieve high accuracy on compressed ViTs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yuhao318/UP-ViT
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Visual Attention and Saliency Detection

MethodsMulti-Head Attention · Attention Is All You Need · Pruning · Linear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Softmax · Residual Connection · Dense Connections · Layer Normalization · Vision Transformer