SPViT: Enabling Faster Vision Transformers via Soft Token Pruning

Zhenglun Kong; Peiyan Dong; Xiaolong Ma; Xin Meng; Mengshu Sun; Wei; Niu; Xuan Shen; Geng Yuan; Bin Ren; Minghai Qin; Hao Tang; Yanzhi Wang

arXiv:2112.13890·cs.CV·September 22, 2022·27 cites

SPViT: Enabling Faster Vision Transformers via Soft Token Pruning

Zhenglun Kong, Peiyan Dong, Xiaolong Ma, Xin Meng, Mengshu Sun, Wei, Niu, Xuan Shen, Geng Yuan, Bin Ren, Minghai Qin, Hao Tang, Yanzhi Wang

PDF

Open Access 1 Repo

TL;DR

SPViT introduces a soft token pruning framework for Vision Transformers that reduces computational costs and latency on edge devices while maintaining high accuracy, enabling real-time performance on mobile platforms.

Contribution

The paper proposes a novel, computation-aware soft pruning method with a dynamic token selector for ViTs, improving efficiency and deployment feasibility on resource-constrained devices.

Findings

01

Significantly reduces ViT computation cost and latency.

02

Maintains comparable accuracy with state-of-the-art models.

03

Enables real-time inference on mobile devices.

Abstract

Recently, Vision Transformer (ViT) has continuously established new milestones in the computer vision field, while the high computation and memory cost makes its propagation in industrial production difficult. Pruning, a traditional model compression paradigm for hardware efficiency, has been widely applied in various DNN structures. Nevertheless, it stays ambiguous on how to perform exclusive pruning on the ViT structure. Considering three key points: the structural characteristics, the internal data pattern of ViTs, and the related edge device deployment, we leverage the input token sparsity and propose a computation-aware soft pruning framework, which can be set up on vanilla Transformers of both flatten and CNN-type structures, such as Pooling-based ViT (PiT). More concretely, we design a dynamic attention-based multi-head token selector, which is a lightweight module for adaptive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

peiyanflying/spvit
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Image Processing Techniques and Applications · CCD and CMOS Imaging Sensors

MethodsMulti-Head Attention · Attention Is All You Need · Pruning · Linear Layer · Label Smoothing · Absolute Position Encodings · Residual Connection · Dropout · Softmax · Byte Pair Encoding