A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP
Yucheng Zhao, Guangting Wang, Chuanxin Tang, Chong Luo, Wenjun Zeng,, Zheng-Jun Zha

TL;DR
This paper empirically compares CNN, Transformer, and MLP architectures for image classification within a unified framework, revealing their strengths and differences at various scales, and proposes hybrid models that achieve competitive performance.
Contribution
It introduces the SPACH framework for fair comparison of DNN structures and proposes hybrid models combining convolution and Transformer modules with state-of-the-art accuracy.
Findings
All structures perform competitively at moderate scale.
Distinct behaviors emerge as network size increases.
Hybrid models can match state-of-the-art performance.
Abstract
Convolutional neural networks (CNN) are the dominant deep neural network (DNN) architecture for computer vision. Recently, Transformer and multi-layer perceptron (MLP)-based models, such as Vision Transformer and MLP-Mixer, started to lead new trends as they showed promising results in the ImageNet classification task. In this paper, we conduct empirical studies on these DNN structures and try to understand their respective pros and cons. To ensure a fair comparison, we first develop a unified framework called SPACH which adopts separate modules for spatial and channel processing. Our experiments under the SPACH framework reveal that all structures can achieve competitive performance at a moderate scale. However, they demonstrate distinctive behaviors when the network size scales up. Based on our findings, we propose two hybrid models using convolution and Transformer modules. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Adversarial Robustness in Machine Learning · Domain Adaptation and Few-Shot Learning
MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Average Pooling · Dense Connections · Global Average Pooling · Refunds@Expedia|||How do I get a full refund from Expedia? · Byte Pair Encoding
