On the Connection between Local Attention and Dynamic Depth-wise   Convolution

Qi Han; Zejia Fan; Qi Dai; Lei Sun; Ming-Ming Cheng; Jiaying Liu,; Jingdong Wang

arXiv:2106.04263·cs.CV·August 5, 2022·71 cites

On the Connection between Local Attention and Dynamic Depth-wise Convolution

Qi Han, Zejia Fan, Qi Dai, Lei Sun, Ming-Ming Cheng, Jiaying Liu,, Jingdong Wang

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper explores the relationship between local attention in Vision Transformers and depth-wise convolution, revealing their similarities and differences, and demonstrating that depth-wise convolution can achieve comparable performance with lower complexity.

Contribution

It provides a novel analysis connecting local attention to depth-wise convolution, highlighting the role of weight sharing and dynamic weights in model performance.

Findings

01

Depth-wise convolution performs on par with local attention in vision tasks.

02

Dynamic depth-wise convolution can outperform standard local attention models.

03

Local attention benefits from regularization forms similar to depth-wise convolution.

Abstract

Vision Transformer (ViT) attains state-of-the-art performance in visual recognition, and the variant, Local Vision Transformer, makes further improvements. The major component in Local Vision Transformer, local attention, performs the attention separately over small local windows. We rephrase local attention as a channel-wise locally-connected layer and analyze it from two network regularization manners, sparse connectivity and weight sharing, as well as weight computation. Sparse connectivity: there is no connection across channels, and each position is connected to the positions within a small local window. Weight sharing: the connection weights for one position are shared across channels or within each group of channels. Dynamic weight: the connection weights are dynamically predicted according to each image instance. We point out that local attention resembles depth-wise convolution…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

atten4vis/demystifylocalvit
pytorchOfficial

Videos

On the Connection between Local Attention and Dynamic Depth-wise Convolution· slideslive

Taxonomy

TopicsCCD and CMOS Imaging Sensors · Brain Tumor Detection and Classification · Advanced Memory and Neural Computing

MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Stochastic Depth · Swin Transformer · Adam · Vision Transformer · Label Smoothing · Convolution