Learning Correlation Structures for Vision Transformers

Manjin Kim; Paul Hongsuck Seo; Cordelia Schmid; Minsu Cho

arXiv:2404.03924·cs.CV·April 8, 2024·2 cites

Learning Correlation Structures for Vision Transformers

Manjin Kim, Paul Hongsuck Seo, Cordelia Schmid, Minsu Cho

PDF

Open Access

TL;DR

This paper proposes a novel attention mechanism called StructSA that captures rich correlation patterns in key-query interactions, enhancing vision transformers' ability to model structural information in images and videos, leading to state-of-the-art results.

Contribution

The introduction of StructSA, a new attention mechanism that leverages structural correlation patterns for improved vision transformer performance.

Findings

01

Achieved state-of-the-art results on multiple image and video classification benchmarks.

02

Demonstrated effectiveness of StructSA in capturing scene layouts, object motion, and inter-object relations.

03

Improved accuracy over existing attention mechanisms in vision transformers.

Abstract

We introduce a new attention mechanism, dubbed structural self-attention (StructSA), that leverages rich correlation patterns naturally emerging in key-query interactions of attention. StructSA generates attention maps by recognizing space-time structures of key-query correlations via convolution and uses them to dynamically aggregate local contexts of value features. This effectively leverages rich structural patterns in images and videos such as scene layouts, object motion, and inter-object relations. Using StructSA as a main building block, we develop the structural vision transformer (StructViT) and evaluate its effectiveness on both image and video classification tasks, achieving state-of-the-art results on ImageNet-1K, Kinetics-400, Something-Something V1 & V2, Diving-48, and FineGym.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Advanced Vision and Imaging

MethodsAttention Is All You Need · Softmax · Linear Layer · Residual Connection · Multi-Head Attention · Layer Normalization · Dense Connections · Vision Transformer · Convolution