Learning Correlation Structures for Vision Transformers
Manjin Kim, Paul Hongsuck Seo, Cordelia Schmid, Minsu Cho

TL;DR
This paper proposes a novel attention mechanism called StructSA that captures rich correlation patterns in key-query interactions, enhancing vision transformers' ability to model structural information in images and videos, leading to state-of-the-art results.
Contribution
The introduction of StructSA, a new attention mechanism that leverages structural correlation patterns for improved vision transformer performance.
Findings
Achieved state-of-the-art results on multiple image and video classification benchmarks.
Demonstrated effectiveness of StructSA in capturing scene layouts, object motion, and inter-object relations.
Improved accuracy over existing attention mechanisms in vision transformers.
Abstract
We introduce a new attention mechanism, dubbed structural self-attention (StructSA), that leverages rich correlation patterns naturally emerging in key-query interactions of attention. StructSA generates attention maps by recognizing space-time structures of key-query correlations via convolution and uses them to dynamically aggregate local contexts of value features. This effectively leverages rich structural patterns in images and videos such as scene layouts, object motion, and inter-object relations. Using StructSA as a main building block, we develop the structural vision transformer (StructViT) and evaluate its effectiveness on both image and video classification tasks, achieving state-of-the-art results on ImageNet-1K, Kinetics-400, Something-Something V1 & V2, Diving-48, and FineGym.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Advanced Vision and Imaging
MethodsAttention Is All You Need · Softmax · Linear Layer · Residual Connection · Multi-Head Attention · Layer Normalization · Dense Connections · Vision Transformer · Convolution
