Training-Free Acceleration of ViTs with Delayed Spatial Merging
Jung Hwan Heo, Seyedarmin Azizi, Arash Fayyazi, Massoud Pedram

TL;DR
This paper introduces DSM, a training-free framework that accelerates Vision Transformers by delaying spatial merging and leveraging hierarchical representations, achieving significant speedups with minimal accuracy loss.
Contribution
It proposes a novel delayed spatial merging technique that improves token merging by analyzing attention behavior and incorporating hierarchical processing, without retraining.
Findings
Up to 1.8× FLOP reduction and 1.6× throughput speedup.
Two orders of magnitude faster than existing token merging methods.
Negligible accuracy loss across various ViT models and tasks.
Abstract
Token merging has emerged as a new paradigm that can accelerate the inference of Vision Transformers (ViTs) without any retraining or fine-tuning. To push the frontier of training-free acceleration in ViTs, we improve token merging by adding the perspectives of 1) activation outliers and 2) hierarchical representations. Through a careful analysis of the attention behavior in ViTs, we characterize a delayed onset of the convergent attention phenomenon, which makes token merging undesirable in the bottom blocks of ViTs. Moreover, we augment token merging with a hierarchical processing scheme to capture multi-scale redundancy between visual tokens. Combining these two insights, we build a unified inference framework called DSM: Delayed Spatial Merging. We extensively evaluate DSM on various ViT model scales (Tiny to Huge) and tasks (ImageNet-1k and transfer learning), achieving up to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCCD and CMOS Imaging Sensors · Image Enhancement Techniques · Advanced Vision and Imaging
MethodsMulti-Head Attention · Attention Is All You Need · Pruning · Linear Layer · Feedforward Network · Attention Dropout · Layer Normalization · Residual Connection · Data-efficient Image Transformer · Byte Pair Encoding
