SPViT: Enabling Faster Vision Transformers via Soft Token Pruning
Zhenglun Kong, Peiyan Dong, Xiaolong Ma, Xin Meng, Mengshu Sun, Wei, Niu, Xuan Shen, Geng Yuan, Bin Ren, Minghai Qin, Hao Tang, Yanzhi Wang

TL;DR
SPViT introduces a soft token pruning framework for Vision Transformers that reduces computational costs and latency on edge devices while maintaining high accuracy, enabling real-time performance on mobile platforms.
Contribution
The paper proposes a novel, computation-aware soft pruning method with a dynamic token selector for ViTs, improving efficiency and deployment feasibility on resource-constrained devices.
Findings
Significantly reduces ViT computation cost and latency.
Maintains comparable accuracy with state-of-the-art models.
Enables real-time inference on mobile devices.
Abstract
Recently, Vision Transformer (ViT) has continuously established new milestones in the computer vision field, while the high computation and memory cost makes its propagation in industrial production difficult. Pruning, a traditional model compression paradigm for hardware efficiency, has been widely applied in various DNN structures. Nevertheless, it stays ambiguous on how to perform exclusive pruning on the ViT structure. Considering three key points: the structural characteristics, the internal data pattern of ViTs, and the related edge device deployment, we leverage the input token sparsity and propose a computation-aware soft pruning framework, which can be set up on vanilla Transformers of both flatten and CNN-type structures, such as Pooling-based ViT (PiT). More concretely, we design a dynamic attention-based multi-head token selector, which is a lightweight module for adaptive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Image Processing Techniques and Applications · CCD and CMOS Imaging Sensors
MethodsMulti-Head Attention · Attention Is All You Need · Pruning · Linear Layer · Label Smoothing · Absolute Position Encodings · Residual Connection · Dropout · Softmax · Byte Pair Encoding
