AiluRus: A Scalable ViT Framework for Dense Prediction

Jin Li; Yaoming Wang; Xiaopeng Zhang; Bowen Shi; Dongsheng Jiang,; Chenglin Li; Wenrui Dai; Hongkai Xiong; Qi Tian

arXiv:2311.01197·cs.CV·November 3, 2023·2 cites

AiluRus: A Scalable ViT Framework for Dense Prediction

Jin Li, Yaoming Wang, Xiaopeng Zhang, Bowen Shi, Dongsheng Jiang,, Chenglin Li, Wenrui Dai, Hongkai Xiong, Qi Tian

PDF

Open Access 1 Repo 1 Video

TL;DR

AiluRus introduces an adaptive resolution approach for vision transformers that selectively merges tokens based on importance, significantly accelerating dense prediction tasks while maintaining performance.

Contribution

The paper proposes a novel adaptive resolution method using spatial-aware clustering in ViTs to reduce tokens and accelerate dense prediction tasks without performance loss.

Findings

01

48% FPS acceleration without fine-tuning

02

52% training time reduction

03

2.46x FPS speedup with minimal performance drop

Abstract

Vision transformers (ViTs) have emerged as a prevalent architecture for vision tasks owing to their impressive performance. However, when it comes to handling long token sequences, especially in dense prediction tasks that require high-resolution input, the complexity of ViTs increases significantly. Notably, dense prediction tasks, such as semantic segmentation or object detection, emphasize more on the contours or shapes of objects, while the texture inside objects is less informative. Motivated by this observation, we propose to apply adaptive resolution for different regions in the image according to their importance. Specifically, at the intermediate layer of the ViT, we utilize a spatial-aware density-based clustering algorithm to select representative tokens from the token sequence. Once the representative tokens are determined, we proceed to merge other tokens into their closest…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

caddyless/ailurus
noneOfficial

Videos

AiluRus: A Scalable ViT Framework for Dense Prediction· slideslive

Taxonomy

TopicsAdvanced Neural Network Applications · Visual Attention and Saliency Detection · CCD and CMOS Imaging Sensors