MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features
Shakti N. Wadekar, Abhishek Chaurasia

TL;DR
MobileViTv3 introduces a simplified and effective fusion block for mobile-friendly vision transformers, significantly improving accuracy across multiple datasets while maintaining lightweight models for mobile vision tasks.
Contribution
The paper proposes a new fusion block for MobileViT that simplifies scaling and learning, leading to improved performance over previous MobileViT versions and MobileViTv2.
Findings
MobileViTv3 models outperform MobileViTv1 on ImageNet-1k, ADE20K, COCO, and PascalVOC2012.
MobileViTv3-XXS and XS surpass MobileViTv1-XXS and XS by 2% and 1.9% on ImageNet-1K.
Adding the fusion block to MobileViTv2 enhances accuracy across datasets.
Abstract
MobileViT (MobileViTv1) combines convolutional neural networks (CNNs) and vision transformers (ViTs) to create light-weight models for mobile vision tasks. Though the main MobileViTv1-block helps to achieve competitive state-of-the-art results, the fusion block inside MobileViTv1-block, creates scaling challenges and has a complex learning task. We propose changes to the fusion block that are simple and effective to create MobileViTv3-block, which addresses the scaling and simplifies the learning task. Our proposed MobileViTv3-block used to create MobileViTv3-XXS, XS and S models outperform MobileViTv1 on ImageNet-1k, ADE20K, COCO and PascalVOC2012 datasets. On ImageNet-1K, MobileViTv3-XXS and MobileViTv3-XS surpasses MobileViTv1-XXS and MobileViTv1-XS by 2% and 1.9% respectively. Recently published MobileViTv2 architecture removes fusion block and uses linear complexity transformers to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · COVID-19 diagnosis using AI · Brain Tumor Detection and Classification
MethodsMobileViTv3 · MobileViTv2 · Dropout · Pointwise Convolution · Depthwise Separable Convolution · Depthwise Convolution · Batch Normalization · Softmax · Linear Layer · Convolution
