MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer
Sachin Mehta, Mohammad Rastegari

TL;DR
MobileViT is a novel lightweight vision transformer that combines CNN and ViT strengths, achieving high accuracy on mobile vision tasks with low parameters and latency.
Contribution
It introduces MobileViT, a new hybrid architecture that processes global information efficiently on mobile devices, outperforming existing CNN and ViT models in accuracy.
Findings
MobileViT achieves 78.4% top-1 accuracy on ImageNet-1k.
MobileViT outperforms MobileNetv3 and DeIT with similar parameters.
MobileViT improves object detection accuracy by 5.7% over MobileNetv3.
Abstract
Light-weight convolutional neural networks (CNNs) are the de-facto for mobile vision tasks. Their spatial inductive biases allow them to learn representations with fewer parameters across different vision tasks. However, these networks are spatially local. To learn global representations, self-attention-based vision trans-formers (ViTs) have been adopted. Unlike CNNs, ViTs are heavy-weight. In this paper, we ask the following question: is it possible to combine the strengths of CNNs and ViTs to build a light-weight and low latency network for mobile vision tasks? Towards this end, we introduce MobileViT, a light-weight and general-purpose vision transformer for mobile devices. MobileViT presents a different perspective for the global processing of information with transformers, i.e., transformers as convolutions. Our results show that MobileViT significantly outperforms CNN- and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Matthijs/mobilevit-smallmodel· 13 dl13 dl
- 🤗Matthijs/deeplabv3-mobilevit-smallmodel· 7 dl· ♡ 17 dl♡ 1
- 🤗apple/mobilevit-smallmodel· 2.6M dl· ♡ 832.6M dl♡ 83
- 🤗apple/mobilevit-x-smallmodel· 527 dl· ♡ 8527 dl♡ 8
- 🤗apple/mobilevit-xx-smallmodel· 3.8k dl· ♡ 203.8k dl♡ 20
- 🤗apple/deeplabv3-mobilevit-smallmodel· 1.1k dl· ♡ 181.1k dl♡ 18
- 🤗apple/deeplabv3-mobilevit-x-smallmodel· 158 dl· ♡ 3158 dl♡ 3
- 🤗apple/deeplabv3-mobilevit-xx-smallmodel· 1.4k dl· ♡ 101.4k dl♡ 10
- 🤗kadirnar/timm_model_listmodel· ♡ 1♡ 1
- 🤗timm/mobilevit_s.cvnets_in1kmodel· 30k dl· ♡ 530k dl♡ 5
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsAttention Is All You Need · MobileViT · *Communicated@Fast*How Do I Communicate to Expedia? · Linear Layer · Sigmoid Activation · Average Pooling · Depthwise Convolution · Pointwise Convolution · Depthwise Separable Convolution · Squeeze-and-Excitation Block
