Mobile-Former: Bridging MobileNet and Transformer
Yinpeng Chen, Xiyang Dai, Dongdong Chen, Mengchen Liu and, Xiaoyi Dong, Lu Yuan, Zicheng Liu

TL;DR
Mobile-Former introduces a parallel MobileNet and transformer architecture with a bidirectional bridge, enabling efficient global and local feature fusion, leading to superior performance on vision tasks with reduced computation.
Contribution
The paper proposes a novel Mobile-Former architecture that combines MobileNet and transformer with a lightweight cross attention bridge, enhancing efficiency and representation power.
Findings
Outperforms MobileNetV3 at low FLOP regimes on ImageNet
Achieves higher accuracy in object detection tasks
Reduces computational cost in end-to-end detection models
Abstract
We present Mobile-Former, a parallel design of MobileNet and transformer with a two-way bridge in between. This structure leverages the advantages of MobileNet at local processing and transformer at global interaction. And the bridge enables bidirectional fusion of local and global features. Different from recent works on vision transformer, the transformer in Mobile-Former contains very few tokens (e.g. 6 or fewer tokens) that are randomly initialized to learn global priors, resulting in low computational cost. Combining with the proposed light-weight cross attention to model the bridge, Mobile-Former is not only computationally efficient, but also has more representation power. It outperforms MobileNetV3 at low FLOP regime from 25M to 500M FLOPs on ImageNet classification. For instance, Mobile-Former achieves 77.9\% top-1 accuracy at 294M FLOPs, gaining 1.3\% over MobileNetV3 but…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Brain Tumor Detection and Classification · Video Surveillance and Tracking Methods
MethodsMulti-Head Attention · Attention Is All You Need · *Communicated@Fast*How Do I Communicate to Expedia? · Linear Layer · Feature Pyramid Network · Focal Loss · RetinaNet · Feedforward Network · Detection Transformer · Absolute Position Encodings
