Demystify Transformers & Convolutions in Modern Image Deep Networks

Xiaowei Hu; Min Shi; Weiyun Wang; Sitong Wu; Linjie Xing; Wenhai Wang; Xizhou Zhu; Lewei Lu; Jie Zhou; Xiaogang Wang; Yu Qiao; Jifeng Dai

arXiv:2211.05781·cs.CV·June 23, 2025·1 cites

Demystify Transformers & Convolutions in Modern Image Deep Networks

Xiaowei Hu, Min Shi, Weiyun Wang, Sitong Wu, Linjie Xing, Wenhai Wang, Xizhou Zhu, Lewei Lu, Jie Zhou, Xiaogang Wang, Yu Qiao, Jifeng Dai

PDF

Open Access 1 Repo

TL;DR

This paper investigates the true contributions of convolution and attention modules in modern image networks, highlighting the importance of network architecture and spatial feature aggregation, and providing a unified framework for fair comparison.

Contribution

It introduces a unified architecture to compare different spatial token mixers (STMs) impartially, revealing the actual gains of feature transformation modules beyond architecture design.

Findings

01

Advanced network-level and block-level designs significantly boost performance.

02

Differences among STMs persist even with unified architecture.

03

Insights into receptive fields, invariance, and robustness of STMs.

Abstract

Vision transformers have gained popularity recently, leading to the development of new vision backbones with improved features and consistent performance gains. However, these advancements are not solely attributable to novel feature transformation designs; certain benefits also arise from advanced network-level and block-level architectures. This paper aims to identify the real gains of popular convolution and attention operators through a detailed study. We find that the key difference among these feature transformation modules, such as attention or convolution, lies in their spatial feature aggregation approach, known as the "spatial token mixer" (STM). To facilitate an impartial comparison, we introduce a unified architecture to neutralize the impact of divergent network-level and block-level designs. Subsequently, various STMs are integrated into this unified framework for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

opengvlab/stm-evaluation
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Visual Attention and Saliency Detection · Image Processing Techniques and Applications

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Dropout · Byte Pair Encoding · Adam · Dense Connections · Softmax