Dynamic Token-Pass Transformers for Semantic Segmentation

Yuang Liu; Qiang Zhou; Jing Wang; Fan Wang; Jun Wang; Wei Zhang

arXiv:2308.01944·cs.CV·August 25, 2023·1 cites

Dynamic Token-Pass Transformers for Semantic Segmentation

Yuang Liu, Qiang Zhou, Jing Wang, Fan Wang, Jun Wang, Wei Zhang

PDF

Open Access 1 Video

TL;DR

This paper introduces DoViT, a dynamic token-pass vision transformer that adaptively reduces inference cost in semantic segmentation by selectively stopping easy tokens, achieving significant speedups with minimal accuracy loss.

Contribution

The paper proposes a novel dynamic token-pass mechanism in vision transformers for semantic segmentation, enabling adaptive inference cost reduction based on token complexity.

Findings

01

Reduces 40-60% FLOPs with less than 0.8% mIoU drop

02

Speeds up ViT-L/B by over 2x on Cityscapes

03

Effective token separation and reconstruction for accurate segmentation

Abstract

Vision transformers (ViT) usually extract features via forwarding all the tokens in the self-attention layers from top to toe. In this paper, we introduce dynamic token-pass vision transformers (DoViT) for semantic segmentation, which can adaptively reduce the inference cost for images with different complexity. DoViT gradually stops partial easy tokens from self-attention calculation and keeps the hard tokens forwarding until meeting the stopping criteria. We employ lightweight auxiliary heads to make the token-pass decision and divide the tokens into keeping/stopping parts. With a token separate calculation, the self-attention layers are speeded up with sparse tokens and still work friendly with hardware. A token reconstruction module is built to collect and reset the grouped tokens to their original position in the sequence, which is necessary to predict correct semantic masks. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Dynamic Token-Pass Transformers for Semantic Segmentation· youtube

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Visual Attention and Saliency Detection

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings