MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision   Transformer

Sachin Mehta; Mohammad Rastegari

arXiv:2110.02178·cs.CV·March 7, 2022·142 cites

MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer

Sachin Mehta, Mohammad Rastegari

PDF

Open Access 5 Repos 10 Models 1 Video

TL;DR

MobileViT is a novel lightweight vision transformer that combines CNN and ViT strengths, achieving high accuracy on mobile vision tasks with low parameters and latency.

Contribution

It introduces MobileViT, a new hybrid architecture that processes global information efficiently on mobile devices, outperforming existing CNN and ViT models in accuracy.

Findings

01

MobileViT achieves 78.4% top-1 accuracy on ImageNet-1k.

02

MobileViT outperforms MobileNetv3 and DeIT with similar parameters.

03

MobileViT improves object detection accuracy by 5.7% over MobileNetv3.

Abstract

Light-weight convolutional neural networks (CNNs) are the de-facto for mobile vision tasks. Their spatial inductive biases allow them to learn representations with fewer parameters across different vision tasks. However, these networks are spatially local. To learn global representations, self-attention-based vision trans-formers (ViTs) have been adopted. Unlike CNNs, ViTs are heavy-weight. In this paper, we ask the following question: is it possible to combine the strengths of CNNs and ViTs to build a light-weight and low latency network for mobile vision tasks? Towards this end, we introduce MobileViT, a light-weight and general-purpose vision transformer for mobile devices. MobileViT presents a different perspective for the global processing of information with transformers, i.e., transformers as convolutions. Our results show that MobileViT significantly outperforms CNN- and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer· slideslive

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsAttention Is All You Need · MobileViT · *Communicated@Fast*How Do I Communicate to Expedia? · Linear Layer · Sigmoid Activation · Average Pooling · Depthwise Convolution · Pointwise Convolution · Depthwise Separable Convolution · Squeeze-and-Excitation Block