Is Space-Time Attention All You Need for Video Understanding?
Gedas Bertasius, Heng Wang, Lorenzo Torresani

TL;DR
TimeSformer introduces a purely attention-based, convolution-free model for video classification that achieves state-of-the-art accuracy, faster training, and better scalability to longer videos by leveraging spatiotemporal self-attention.
Contribution
The paper proposes a novel Transformer-based architecture for video understanding that outperforms existing methods and explores different self-attention schemes, notably divided attention.
Findings
Achieves state-of-the-art accuracy on Kinetics-400 and Kinetics-600 datasets.
Faster training and higher test efficiency compared to 3D CNNs.
Effective on long video clips over one minute long.
Abstract
We present a convolution-free approach to video classification built exclusively on self-attention over space and time. Our method, named "TimeSformer," adapts the standard Transformer architecture to video by enabling spatiotemporal feature learning directly from a sequence of frame-level patches. Our experimental study compares different self-attention schemes and suggests that "divided attention," where temporal attention and spatial attention are separately applied within each block, leads to the best video classification accuracy among the design choices considered. Despite the radically new design, TimeSformer achieves state-of-the-art results on several action recognition benchmarks, including the best reported accuracy on Kinetics-400 and Kinetics-600. Finally, compared to 3D convolutional networks, our model is faster to train, it can achieve dramatically higher test efficiency…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗facebook/timesformer-base-finetuned-k400model· 22k dl· ♡ 4322k dl♡ 43
- 🤗facebook/timesformer-base-finetuned-ssv2model· 7.3k dl· ♡ 37.3k dl♡ 3
- 🤗facebook/timesformer-base-finetuned-k600model· 19k dl· ♡ 1219k dl♡ 12
- 🤗facebook/timesformer-hr-finetuned-k400model· 121 dl· ♡ 3121 dl♡ 3
- 🤗facebook/timesformer-hr-finetuned-ssv2model· 102 dl· ♡ 2102 dl♡ 2
- 🤗facebook/timesformer-hr-finetuned-k600model· 80 dl· ♡ 680 dl♡ 6
- 🤗fcakyon/timesformer-hr-finetuned-k400model· 3 dl3 dl
- 🤗fcakyon/timesformer-hr-finetuned-k600model· 3 dl3 dl
- 🤗fcakyon/timesformer-hr-finetuned-ssv2model· 4 dl4 dl
- 🤗fcakyon/timesformer-large-finetuned-ssv2model· 14 dl14 dl
Videos
Taxonomy
TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Hand Gesture Recognition Systems
MethodsTimeSformer
