Dynamic Token-Pass Transformers for Semantic Segmentation
Yuang Liu, Qiang Zhou, Jing Wang, Fan Wang, Jun Wang, Wei Zhang

TL;DR
This paper introduces DoViT, a dynamic token-pass vision transformer that adaptively reduces inference cost in semantic segmentation by selectively stopping easy tokens, achieving significant speedups with minimal accuracy loss.
Contribution
The paper proposes a novel dynamic token-pass mechanism in vision transformers for semantic segmentation, enabling adaptive inference cost reduction based on token complexity.
Findings
Reduces 40-60% FLOPs with less than 0.8% mIoU drop
Speeds up ViT-L/B by over 2x on Cityscapes
Effective token separation and reconstruction for accurate segmentation
Abstract
Vision transformers (ViT) usually extract features via forwarding all the tokens in the self-attention layers from top to toe. In this paper, we introduce dynamic token-pass vision transformers (DoViT) for semantic segmentation, which can adaptively reduce the inference cost for images with different complexity. DoViT gradually stops partial easy tokens from self-attention calculation and keeps the hard tokens forwarding until meeting the stopping criteria. We employ lightweight auxiliary heads to make the token-pass decision and divide the tokens into keeping/stopping parts. With a token separate calculation, the self-attention layers are speeded up with sparse tokens and still work friendly with hardware. A token reconstruction module is built to collect and reset the grouped tokens to their original position in the sequence, which is necessary to predict correct semantic masks. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Dynamic Token-Pass Transformers for Semantic Segmentation· youtube
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Visual Attention and Saliency Detection
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
