AiluRus: A Scalable ViT Framework for Dense Prediction
Jin Li, Yaoming Wang, Xiaopeng Zhang, Bowen Shi, Dongsheng Jiang,, Chenglin Li, Wenrui Dai, Hongkai Xiong, Qi Tian

TL;DR
AiluRus introduces an adaptive resolution approach for vision transformers that selectively merges tokens based on importance, significantly accelerating dense prediction tasks while maintaining performance.
Contribution
The paper proposes a novel adaptive resolution method using spatial-aware clustering in ViTs to reduce tokens and accelerate dense prediction tasks without performance loss.
Findings
48% FPS acceleration without fine-tuning
52% training time reduction
2.46x FPS speedup with minimal performance drop
Abstract
Vision transformers (ViTs) have emerged as a prevalent architecture for vision tasks owing to their impressive performance. However, when it comes to handling long token sequences, especially in dense prediction tasks that require high-resolution input, the complexity of ViTs increases significantly. Notably, dense prediction tasks, such as semantic segmentation or object detection, emphasize more on the contours or shapes of objects, while the texture inside objects is less informative. Motivated by this observation, we propose to apply adaptive resolution for different regions in the image according to their importance. Specifically, at the intermediate layer of the ViT, we utilize a spatial-aware density-based clustering algorithm to select representative tokens from the token sequence. Once the representative tokens are determined, we proceed to merge other tokens into their closest…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Visual Attention and Saliency Detection · CCD and CMOS Imaging Sensors
