ELSA: Enhanced Local Self-Attention for Vision Transformer

Jingkai Zhou; Pichao Wang; Fan Wang; Qiong Liu; Hao Li and; Rong Jin

arXiv:2112.12786·cs.CV·December 24, 2021

ELSA: Enhanced Local Self-Attention for Vision Transformer

Jingkai Zhou, Pichao Wang, Fan Wang, Qiong Liu, Hao Li and, Rong Jin

PDF

Open Access 1 Repo

TL;DR

ELSA enhances local self-attention in vision transformers by introducing Hadamard attention and ghost head mechanisms, significantly improving accuracy and downstream task performance without altering architecture.

Contribution

This paper proposes ELSA, a novel local self-attention method that addresses limitations of existing LSA by incorporating Hadamard attention and ghost head, boosting vision transformer performance.

Findings

01

ELSA improves Swin Transformer accuracy by up to +1.4 top-1.

02

ELSA achieves 87.2% on ImageNet-1K with VOLO-D5.

03

ELSA enhances downstream tasks like object detection and segmentation.

Abstract

Self-attention is powerful in modeling long-range dependencies, but it is weak in local finer-level feature learning. The performance of local self-attention (LSA) is just on par with convolution and inferior to dynamic filters, which puzzles researchers on whether to use LSA or its counterparts, which one is better, and what makes LSA mediocre. To clarify these, we comprehensively investigate LSA and its counterparts from two sides: \emph{channel setting} and \emph{spatial processing}. We find that the devil lies in the generation and application of spatial attention, where relative position embeddings and the neighboring filter application are key factors. Based on these findings, we propose the enhanced local self-attention (ELSA) with Hadamard attention and the ghost head. Hadamard attention introduces the Hadamard product to efficiently generate attention in the neighboring case,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

damo-cv/elsa
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Brain Tumor Detection and Classification

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Position-Wise Feed-Forward Layer · Adam · Layer Normalization · Absolute Position Encodings · Residual Connection · Dropout