Mobile-Former: Bridging MobileNet and Transformer

Yinpeng Chen; Xiyang Dai; Dongdong Chen; Mengchen Liu and; Xiaoyi Dong; Lu Yuan; Zicheng Liu

arXiv:2108.05895·cs.CV·March 4, 2022·36 cites

Mobile-Former: Bridging MobileNet and Transformer

Yinpeng Chen, Xiyang Dai, Dongdong Chen, Mengchen Liu and, Xiaoyi Dong, Lu Yuan, Zicheng Liu

PDF

Open Access 4 Repos

TL;DR

Mobile-Former introduces a parallel MobileNet and transformer architecture with a bidirectional bridge, enabling efficient global and local feature fusion, leading to superior performance on vision tasks with reduced computation.

Contribution

The paper proposes a novel Mobile-Former architecture that combines MobileNet and transformer with a lightweight cross attention bridge, enhancing efficiency and representation power.

Findings

01

Outperforms MobileNetV3 at low FLOP regimes on ImageNet

02

Achieves higher accuracy in object detection tasks

03

Reduces computational cost in end-to-end detection models

Abstract

We present Mobile-Former, a parallel design of MobileNet and transformer with a two-way bridge in between. This structure leverages the advantages of MobileNet at local processing and transformer at global interaction. And the bridge enables bidirectional fusion of local and global features. Different from recent works on vision transformer, the transformer in Mobile-Former contains very few tokens (e.g. 6 or fewer tokens) that are randomly initialized to learn global priors, resulting in low computational cost. Combining with the proposed light-weight cross attention to model the bridge, Mobile-Former is not only computationally efficient, but also has more representation power. It outperforms MobileNetV3 at low FLOP regime from 25M to 500M FLOPs on ImageNet classification. For instance, Mobile-Former achieves 77.9\% top-1 accuracy at 294M FLOPs, gaining 1.3\% over MobileNetV3 but…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Brain Tumor Detection and Classification · Video Surveillance and Tracking Methods

MethodsMulti-Head Attention · Attention Is All You Need · *Communicated@Fast*How Do I Communicate to Expedia? · Linear Layer · Feature Pyramid Network · Focal Loss · RetinaNet · Feedforward Network · Detection Transformer · Absolute Position Encodings