Multi-Tailed Vision Transformer for Efficient Inference
Yunke Wang, Bo Du, Wenyuan Wang, Chang Xu

TL;DR
This paper introduces MT-ViT, a multi-tailed vision transformer that adaptively selects the optimal sequence length for efficient inference, significantly reducing FLOPs without accuracy loss.
Contribution
The paper proposes a novel multi-tailed architecture with a tail predictor, enabling dynamic token sequence length selection for improved efficiency in Vision Transformers.
Findings
Reduces FLOPs significantly without accuracy loss.
Outperforms existing methods in efficiency and accuracy.
Demonstrates effectiveness on ImageNet-1K.
Abstract
Recently, Vision Transformer (ViT) has achieved promising performance in image recognition and gradually serves as a powerful backbone in various vision tasks. To satisfy the sequential input of Transformer, the tail of ViT first splits each image into a sequence of visual tokens with a fixed length. Then the following self-attention layers constructs the global relationship between tokens to produce useful representation for the downstream tasks. Empirically, representing the image with more tokens leads to better performance, yet the quadratic computational complexity of self-attention layer to the number of tokens could seriously influence the efficiency of ViT's inference. For computational reduction, a few pruning methods progressively prune uninformative tokens in the Transformer encoder, while leaving the number of tokens before the Transformer untouched. In fact, fewer tokens as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCCD and CMOS Imaging Sensors · Advanced Neural Network Applications · Cell Image Analysis Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Pruning · Linear Layer · Absolute Position Encodings · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Residual Connection · Layer Normalization · Label Smoothing
