Interpret Vision Transformers as ConvNets with Dynamic Convolutions
Chong Zhou, Chen Change Loy, Bo Dai

TL;DR
This paper unifies vision Transformers and ConvNets under a common framework of dynamic convolutions, enabling direct comparison and guiding new network designs that improve efficiency and performance.
Contribution
It introduces a unified interpretation of vision Transformers as ConvNets with dynamic convolutions, facilitating design insights and novel architectures.
Findings
Replacing softmax with ConvNet modules improves convergence and performance.
Designing depth-wise vision Transformers yields more efficient models with comparable accuracy.
Abstract
There has been a debate about the superiority between vision Transformers and ConvNets, serving as the backbone of computer vision models. Although they are usually considered as two completely different architectures, in this paper, we interpret vision Transformers as ConvNets with dynamic convolutions, which enables us to characterize existing Transformers and dynamic ConvNets in a unified framework and compare their design choices side by side. In addition, our interpretation can also guide the network design as researchers now can consider vision Transformers from the design space of ConvNets and vice versa. We demonstrate such potential through two specific studies. First, we inspect the role of softmax in vision Transformers as the activation function and find it can be replaced by commonly used ConvNets modules, such as ReLU and Layer Normalization, which results in a faster…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · CCD and CMOS Imaging Sensors · Advanced Memory and Neural Computing
MethodsAttention Is All You Need · Dense Connections · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Residual Connection · Adam · Linear Layer · Multi-Head Attention · Dropout · Byte Pair Encoding
