ExMobileViT: Lightweight Classifier Extension for Mobile Vision Transformer
Gyeongdong Yang, Yungwook Kwon, and Hyunjin Kim

TL;DR
ExMobileViT introduces a lightweight extension to mobile vision transformers that reuses early attention stage information via average pooling, significantly improving accuracy with minimal additional computational cost.
Contribution
The paper presents a novel method to enhance mobile vision transformers by leveraging early attention features, improving performance with negligible overhead.
Findings
Notable accuracy improvements over MobileViT on ImageNet
Only about 5% increase in parameters
Minimal additional computational overhead
Abstract
The paper proposes an efficient structure for enhancing the performance of mobile-friendly vision transformer with small computational overhead. The vision transformer (ViT) is very attractive in that it reaches outperforming results in image classification, compared to conventional convolutional neural networks (CNNs). Due to its need of high computational resources, MobileNet-based ViT models such as MobileViT-S have been developed. However, their performance cannot reach the original ViT model. The proposed structure relieves the above weakness by storing the information from early attention stages and reusing it in the final classifier. This paper is motivated by the idea that the data itself from early attention stages can have important meaning for the final classification. In order to reuse the early information in attention stages, the average pooling results of various scaled…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCCD and CMOS Imaging Sensors · Advanced Memory and Neural Computing · Image Enhancement Techniques
MethodsAttention Is All You Need · Softmax · Linear Layer · Multi-Head Attention · Residual Connection · Average Pooling · Layer Normalization · Dense Connections · Vision Transformer
