Exploring and Improving Mobile Level Vision Transformers
Pengguang Chen, Yixin Chen, Shu Liu, Mingchang Yang, Jiaya Jia

TL;DR
This paper investigates the performance issues of vision transformers at mobile levels, introduces novel modules for irregular patch embedding and adaptive patch fusion, and achieves state-of-the-art results in mobile vision tasks.
Contribution
The paper proposes new irregular patch embedding and adaptive patch fusion modules to enhance mobile-level vision transformers, significantly improving their performance.
Findings
Improved DeiT baseline by over 9% in mobile settings
Surpassed Swin and CoaT architectures in mobile vision tasks
Achieved state-of-the-art results with the proposed modules
Abstract
We study the vision transformer structure in the mobile level in this paper, and find a dramatic performance drop. We analyze the reason behind this phenomenon, and propose a novel irregular patch embedding module and adaptive patch fusion module to improve the performance. We conjecture that the vision transformer blocks (which consist of multi-head attention and feed-forward network) are more suitable to handle high-level information than low-level features. The irregular patch embedding module extracts patches that contain rich high-level information with different receptive fields. The transformer blocks can obtain the most useful information from these irregular patches. Then the processed patches pass the adaptive patch merging module to get the final features for the classifier. With our proposed improvements, the traditional uniform vision transformer structure can achieve…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Memory and Neural Computing · Visual Attention and Saliency Detection · CCD and CMOS Imaging Sensors
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Attention Dropout · Dense Connections · Feedforward Network · Softmax · Residual Connection · Vision Transformer · Multi-Head Attention
