On the Connection between Local Attention and Dynamic Depth-wise Convolution
Qi Han, Zejia Fan, Qi Dai, Lei Sun, Ming-Ming Cheng, Jiaying Liu,, Jingdong Wang

TL;DR
This paper explores the relationship between local attention in Vision Transformers and depth-wise convolution, revealing their similarities and differences, and demonstrating that depth-wise convolution can achieve comparable performance with lower complexity.
Contribution
It provides a novel analysis connecting local attention to depth-wise convolution, highlighting the role of weight sharing and dynamic weights in model performance.
Findings
Depth-wise convolution performs on par with local attention in vision tasks.
Dynamic depth-wise convolution can outperform standard local attention models.
Local attention benefits from regularization forms similar to depth-wise convolution.
Abstract
Vision Transformer (ViT) attains state-of-the-art performance in visual recognition, and the variant, Local Vision Transformer, makes further improvements. The major component in Local Vision Transformer, local attention, performs the attention separately over small local windows. We rephrase local attention as a channel-wise locally-connected layer and analyze it from two network regularization manners, sparse connectivity and weight sharing, as well as weight computation. Sparse connectivity: there is no connection across channels, and each position is connected to the positions within a small local window. Weight sharing: the connection weights for one position are shared across channels or within each group of channels. Dynamic weight: the connection weights are dynamically predicted according to each image instance. We point out that local attention resembles depth-wise convolution…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsCCD and CMOS Imaging Sensors · Brain Tumor Detection and Classification · Advanced Memory and Neural Computing
MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Stochastic Depth · Swin Transformer · Adam · Vision Transformer · Label Smoothing · Convolution
