Multi-Tailed Vision Transformer for Efficient Inference

Yunke Wang; Bo Du; Wenyuan Wang; Chang Xu

arXiv:2203.01587·cs.CV·March 19, 2024

Multi-Tailed Vision Transformer for Efficient Inference

Yunke Wang, Bo Du, Wenyuan Wang, Chang Xu

PDF

Open Access

TL;DR

This paper introduces MT-ViT, a multi-tailed vision transformer that adaptively selects the optimal sequence length for efficient inference, significantly reducing FLOPs without accuracy loss.

Contribution

The paper proposes a novel multi-tailed architecture with a tail predictor, enabling dynamic token sequence length selection for improved efficiency in Vision Transformers.

Findings

01

Reduces FLOPs significantly without accuracy loss.

02

Outperforms existing methods in efficiency and accuracy.

03

Demonstrates effectiveness on ImageNet-1K.

Abstract

Recently, Vision Transformer (ViT) has achieved promising performance in image recognition and gradually serves as a powerful backbone in various vision tasks. To satisfy the sequential input of Transformer, the tail of ViT first splits each image into a sequence of visual tokens with a fixed length. Then the following self-attention layers constructs the global relationship between tokens to produce useful representation for the downstream tasks. Empirically, representing the image with more tokens leads to better performance, yet the quadratic computational complexity of self-attention layer to the number of tokens could seriously influence the efficiency of ViT's inference. For computational reduction, a few pruning methods progressively prune uninformative tokens in the Transformer encoder, while leaving the number of tokens before the Transformer untouched. In fact, fewer tokens as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCCD and CMOS Imaging Sensors · Advanced Neural Network Applications · Cell Image Analysis Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Pruning · Linear Layer · Absolute Position Encodings · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Residual Connection · Layer Normalization · Label Smoothing