Training-Free Acceleration of ViTs with Delayed Spatial Merging

Jung Hwan Heo; Seyedarmin Azizi; Arash Fayyazi; Massoud Pedram

arXiv:2303.02331·cs.CV·July 2, 2024·1 cites

Training-Free Acceleration of ViTs with Delayed Spatial Merging

Jung Hwan Heo, Seyedarmin Azizi, Arash Fayyazi, Massoud Pedram

PDF

Open Access 1 Repo

TL;DR

This paper introduces DSM, a training-free framework that accelerates Vision Transformers by delaying spatial merging and leveraging hierarchical representations, achieving significant speedups with minimal accuracy loss.

Contribution

It proposes a novel delayed spatial merging technique that improves token merging by analyzing attention behavior and incorporating hierarchical processing, without retraining.

Findings

01

Up to 1.8× FLOP reduction and 1.6× throughput speedup.

02

Two orders of magnitude faster than existing token merging methods.

03

Negligible accuracy loss across various ViT models and tasks.

Abstract

Token merging has emerged as a new paradigm that can accelerate the inference of Vision Transformers (ViTs) without any retraining or fine-tuning. To push the frontier of training-free acceleration in ViTs, we improve token merging by adding the perspectives of 1) activation outliers and 2) hierarchical representations. Through a careful analysis of the attention behavior in ViTs, we characterize a delayed onset of the convergent attention phenomenon, which makes token merging undesirable in the bottom blocks of ViTs. Moreover, we augment token merging with a hierarchical processing scheme to capture multi-scale redundancy between visual tokens. Combining these two insights, we build a unified inference framework called DSM: Delayed Spatial Merging. We extensively evaluate DSM on various ViT model scales (Tiny to Huge) and tasks (ImageNet-1k and transfer learning), achieving up to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

johnheo/fast-compress-vit
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCCD and CMOS Imaging Sensors · Image Enhancement Techniques · Advanced Vision and Imaging

MethodsMulti-Head Attention · Attention Is All You Need · Pruning · Linear Layer · Feedforward Network · Attention Dropout · Layer Normalization · Residual Connection · Data-efficient Image Transformer · Byte Pair Encoding