MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models
Chenglin Yang, Siyuan Qiao, Qihang Yu, Xiaoding Yuan, Yukun Zhu, Alan, Yuille, Hartwig Adam, Liang-Chieh Chen

TL;DR
MOAT introduces a novel neural network architecture that seamlessly combines mobile convolution and attention mechanisms, achieving state-of-the-art results on vision benchmarks with improved efficiency and versatility.
Contribution
The paper proposes MOAT, a new network design merging mobile convolution with attention, replacing traditional stacking, leading to enhanced performance and simplified architecture.
Findings
Achieves 89.1% top-1 accuracy on ImageNet-1K
Surpasses several mobile transformer models on ImageNet
Effective for downstream tasks like detection and segmentation
Abstract
This paper presents MOAT, a family of neural networks that build on top of MObile convolution (i.e., inverted residual blocks) and ATtention. Unlike the current works that stack separate mobile convolution and transformer blocks, we effectively merge them into a MOAT block. Starting with a standard Transformer block, we replace its multi-layer perceptron with a mobile convolution block, and further reorder it before the self-attention operation. The mobile convolution block not only enhances the network representation capacity, but also produces better downsampled features. Our conceptually simple MOAT networks are surprisingly effective, achieving 89.1% / 81.5% top-1 accuracy on ImageNet-1K / ImageNet-1K-V2 with ImageNet22K pretraining. Additionally, MOAT can be seamlessly applied to downstream tasks that require large resolution inputs by simply converting the global attention to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Brain Tumor Detection and Classification
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Label Smoothing · Softmax · Byte Pair Encoding · Convolution · Adam · Dense Connections
