Exploring and Improving Mobile Level Vision Transformers

Pengguang Chen; Yixin Chen; Shu Liu; Mingchang Yang; Jiaya Jia

arXiv:2108.13015·cs.CV·August 31, 2021

Exploring and Improving Mobile Level Vision Transformers

Pengguang Chen, Yixin Chen, Shu Liu, Mingchang Yang, Jiaya Jia

PDF

Open Access

TL;DR

This paper investigates the performance issues of vision transformers at mobile levels, introduces novel modules for irregular patch embedding and adaptive patch fusion, and achieves state-of-the-art results in mobile vision tasks.

Contribution

The paper proposes new irregular patch embedding and adaptive patch fusion modules to enhance mobile-level vision transformers, significantly improving their performance.

Findings

01

Improved DeiT baseline by over 9% in mobile settings

02

Surpassed Swin and CoaT architectures in mobile vision tasks

03

Achieved state-of-the-art results with the proposed modules

Abstract

We study the vision transformer structure in the mobile level in this paper, and find a dramatic performance drop. We analyze the reason behind this phenomenon, and propose a novel irregular patch embedding module and adaptive patch fusion module to improve the performance. We conjecture that the vision transformer blocks (which consist of multi-head attention and feed-forward network) are more suitable to handle high-level information than low-level features. The irregular patch embedding module extracts patches that contain rich high-level information with different receptive fields. The transformer blocks can obtain the most useful information from these irregular patches. Then the processed patches pass the adaptive patch merging module to get the final features for the classifier. With our proposed improvements, the traditional uniform vision transformer structure can achieve…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Memory and Neural Computing · Visual Attention and Saliency Detection · CCD and CMOS Imaging Sensors

MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Attention Dropout · Dense Connections · Feedforward Network · Softmax · Residual Connection · Vision Transformer · Multi-Head Attention