Interpret Vision Transformers as ConvNets with Dynamic Convolutions

Chong Zhou; Chen Change Loy; Bo Dai

arXiv:2309.10713·cs.CV·September 20, 2023

Interpret Vision Transformers as ConvNets with Dynamic Convolutions

Chong Zhou, Chen Change Loy, Bo Dai

PDF

Open Access

TL;DR

This paper unifies vision Transformers and ConvNets under a common framework of dynamic convolutions, enabling direct comparison and guiding new network designs that improve efficiency and performance.

Contribution

It introduces a unified interpretation of vision Transformers as ConvNets with dynamic convolutions, facilitating design insights and novel architectures.

Findings

01

Replacing softmax with ConvNet modules improves convergence and performance.

02

Designing depth-wise vision Transformers yields more efficient models with comparable accuracy.

Abstract

There has been a debate about the superiority between vision Transformers and ConvNets, serving as the backbone of computer vision models. Although they are usually considered as two completely different architectures, in this paper, we interpret vision Transformers as ConvNets with dynamic convolutions, which enables us to characterize existing Transformers and dynamic ConvNets in a unified framework and compare their design choices side by side. In addition, our interpretation can also guide the network design as researchers now can consider vision Transformers from the design space of ConvNets and vice versa. We demonstrate such potential through two specific studies. First, we inspect the role of softmax in vision Transformers as the activation function and find it can be replaced by commonly used ConvNets modules, such as ReLU and Layer Normalization, which results in a faster…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · CCD and CMOS Imaging Sensors · Advanced Memory and Neural Computing

MethodsAttention Is All You Need · Dense Connections · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Residual Connection · Adam · Linear Layer · Multi-Head Attention · Dropout · Byte Pair Encoding