A Battle of Network Structures: An Empirical Study of CNN, Transformer,   and MLP

Yucheng Zhao; Guangting Wang; Chuanxin Tang; Chong Luo; Wenjun Zeng,; Zheng-Jun Zha

arXiv:2108.13002·cs.CV·November 29, 2021·57 cites

A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP

Yucheng Zhao, Guangting Wang, Chuanxin Tang, Chong Luo, Wenjun Zeng,, Zheng-Jun Zha

PDF

Open Access 1 Repo

TL;DR

This paper empirically compares CNN, Transformer, and MLP architectures for image classification within a unified framework, revealing their strengths and differences at various scales, and proposes hybrid models that achieve competitive performance.

Contribution

It introduces the SPACH framework for fair comparison of DNN structures and proposes hybrid models combining convolution and Transformer modules with state-of-the-art accuracy.

Findings

01

All structures perform competitively at moderate scale.

02

Distinct behaviors emerge as network size increases.

03

Hybrid models can match state-of-the-art performance.

Abstract

Convolutional neural networks (CNN) are the dominant deep neural network (DNN) architecture for computer vision. Recently, Transformer and multi-layer perceptron (MLP)-based models, such as Vision Transformer and MLP-Mixer, started to lead new trends as they showed promising results in the ImageNet classification task. In this paper, we conduct empirical studies on these DNN structures and try to understand their respective pros and cons. To ensure a fair comparison, we first develop a unified framework called SPACH which adopts separate modules for spatial and channel processing. Our experiments under the SPACH framework reveal that all structures can achieve competitive performance at a moderate scale. However, they demonstrate distinctive behaviors when the network size scales up. Based on our findings, we propose two hybrid models using convolution and Transformer modules. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/SPACH
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Adversarial Robustness in Machine Learning · Domain Adaptation and Few-Shot Learning

MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Average Pooling · Dense Connections · Global Average Pooling · Refunds@Expedia|||How do I get a full refund from Expedia? · Byte Pair Encoding