Is Space-Time Attention All You Need for Video Understanding?

Gedas Bertasius; Heng Wang; Lorenzo Torresani

arXiv:2102.05095·cs.CV·June 10, 2021·1.4k cites

Is Space-Time Attention All You Need for Video Understanding?

Gedas Bertasius, Heng Wang, Lorenzo Torresani

PDF

Open Access 5 Repos 10 Models 1 Video

TL;DR

TimeSformer introduces a purely attention-based, convolution-free model for video classification that achieves state-of-the-art accuracy, faster training, and better scalability to longer videos by leveraging spatiotemporal self-attention.

Contribution

The paper proposes a novel Transformer-based architecture for video understanding that outperforms existing methods and explores different self-attention schemes, notably divided attention.

Findings

01

Achieves state-of-the-art accuracy on Kinetics-400 and Kinetics-600 datasets.

02

Faster training and higher test efficiency compared to 3D CNNs.

03

Effective on long video clips over one minute long.

Abstract

We present a convolution-free approach to video classification built exclusively on self-attention over space and time. Our method, named "TimeSformer," adapts the standard Transformer architecture to video by enabling spatiotemporal feature learning directly from a sequence of frame-level patches. Our experimental study compares different self-attention schemes and suggests that "divided attention," where temporal attention and spatial attention are separately applied within each block, leads to the best video classification accuracy among the design choices considered. Despite the radically new design, TimeSformer achieves state-of-the-art results on several action recognition benchmarks, including the best reported accuracy on Kinetics-400 and Kinetics-600. Finally, compared to 3D convolutional networks, our model is faster to train, it can achieve dramatically higher test efficiency…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

Is Space-Time Attention All You Need for Video Understanding?· slideslive

Taxonomy

TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Hand Gesture Recognition Systems

MethodsTimeSformer