Relational Self-Attention: What's Missing in Attention for Video   Understanding

Manjin Kim; Heeseung Kwon; Chunyu Wang; Suha Kwak; Minsu Cho

arXiv:2111.01673·cs.CV·November 3, 2021·6 cites

Relational Self-Attention: What's Missing in Attention for Video Understanding

Manjin Kim, Heeseung Kwon, Chunyu Wang, Suha Kwak, Minsu Cho

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces relational self-attention (RSA), a novel dynamic feature transform that captures spatio-temporal relations in videos, significantly improving video understanding and action recognition performance.

Contribution

The paper proposes RSA, a new relational self-attention mechanism that dynamically models spatio-temporal relations, addressing limitations of existing self-attention methods for video analysis.

Findings

01

RSA outperforms convolution and self-attention models on video benchmarks

02

Achieves state-of-the-art results on Something-Something-V1 & V2, Diving48, and FineGym

03

Demonstrates the importance of relational modeling in video understanding

Abstract

Convolution has been arguably the most important feature transform for modern neural networks, leading to the advance of deep learning. Recent emergence of Transformer networks, which replace convolution layers with self-attention blocks, has revealed the limitation of stationary convolution kernels and opened the door to the era of dynamic feature transforms. The existing dynamic transforms, including self-attention, however, are all limited for video understanding where correspondence relations in space and time, i.e., motion information, are crucial for effective representation. In this work, we introduce a relational feature transform, dubbed the relational self-attention (RSA), that leverages rich structures of spatio-temporal relations in videos by dynamically generating relational kernels and aggregating relational contexts. Our experiments and ablation studies show that the RSA…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

KimManjin/RSA
pytorchOfficial

Videos

Relational Self-Attention: What's Missing in Attention for Video Understanding· slideslive

Taxonomy

TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Multimodal Machine Learning Applications

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Label Smoothing · Dense Connections · Residual Connection · Layer Normalization · Absolute Position Encodings · Convolution · Softmax