MobileViTv3: Mobile-Friendly Vision Transformer with Simple and   Effective Fusion of Local, Global and Input Features

Shakti N. Wadekar; Abhishek Chaurasia

arXiv:2209.15159·cs.CV·October 7, 2022·86 cites

MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features

Shakti N. Wadekar, Abhishek Chaurasia

PDF

Open Access 2 Repos

TL;DR

MobileViTv3 introduces a simplified and effective fusion block for mobile-friendly vision transformers, significantly improving accuracy across multiple datasets while maintaining lightweight models for mobile vision tasks.

Contribution

The paper proposes a new fusion block for MobileViT that simplifies scaling and learning, leading to improved performance over previous MobileViT versions and MobileViTv2.

Findings

01

MobileViTv3 models outperform MobileViTv1 on ImageNet-1k, ADE20K, COCO, and PascalVOC2012.

02

MobileViTv3-XXS and XS surpass MobileViTv1-XXS and XS by 2% and 1.9% on ImageNet-1K.

03

Adding the fusion block to MobileViTv2 enhances accuracy across datasets.

Abstract

MobileViT (MobileViTv1) combines convolutional neural networks (CNNs) and vision transformers (ViTs) to create light-weight models for mobile vision tasks. Though the main MobileViTv1-block helps to achieve competitive state-of-the-art results, the fusion block inside MobileViTv1-block, creates scaling challenges and has a complex learning task. We propose changes to the fusion block that are simple and effective to create MobileViTv3-block, which addresses the scaling and simplifies the learning task. Our proposed MobileViTv3-block used to create MobileViTv3-XXS, XS and S models outperform MobileViTv1 on ImageNet-1k, ADE20K, COCO and PascalVOC2012 datasets. On ImageNet-1K, MobileViTv3-XXS and MobileViTv3-XS surpasses MobileViTv1-XXS and MobileViTv1-XS by 2% and 1.9% respectively. Recently published MobileViTv2 architecture removes fusion block and uses linear complexity transformers to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · COVID-19 diagnosis using AI · Brain Tumor Detection and Classification

MethodsMobileViTv3 · MobileViTv2 · Dropout · Pointwise Convolution · Depthwise Separable Convolution · Depthwise Convolution · Batch Normalization · Softmax · Linear Layer · Convolution