Relational Self-Attention: What's Missing in Attention for Video Understanding
Manjin Kim, Heeseung Kwon, Chunyu Wang, Suha Kwak, Minsu Cho

TL;DR
This paper introduces relational self-attention (RSA), a novel dynamic feature transform that captures spatio-temporal relations in videos, significantly improving video understanding and action recognition performance.
Contribution
The paper proposes RSA, a new relational self-attention mechanism that dynamically models spatio-temporal relations, addressing limitations of existing self-attention methods for video analysis.
Findings
RSA outperforms convolution and self-attention models on video benchmarks
Achieves state-of-the-art results on Something-Something-V1 & V2, Diving48, and FineGym
Demonstrates the importance of relational modeling in video understanding
Abstract
Convolution has been arguably the most important feature transform for modern neural networks, leading to the advance of deep learning. Recent emergence of Transformer networks, which replace convolution layers with self-attention blocks, has revealed the limitation of stationary convolution kernels and opened the door to the era of dynamic feature transforms. The existing dynamic transforms, including self-attention, however, are all limited for video understanding where correspondence relations in space and time, i.e., motion information, are crucial for effective representation. In this work, we introduce a relational feature transform, dubbed the relational self-attention (RSA), that leverages rich structures of spatio-temporal relations in videos by dynamically generating relational kernels and aggregating relational contexts. Our experiments and ablation studies show that the RSA…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Multimodal Machine Learning Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Label Smoothing · Dense Connections · Residual Connection · Layer Normalization · Absolute Position Encodings · Convolution · Softmax
