An Image is Worth 16x16 Words, What is a Video Worth?
Gilad Sharir, Asaf Noy, Lihi Zelnik-Manor

TL;DR
This paper introduces a temporal transformer approach for action recognition in videos that significantly reduces computational requirements while maintaining state-of-the-art accuracy by efficiently capturing salient information across frames.
Contribution
The authors propose a global attention-based temporal transformer that reduces the number of frames needed for inference, achieving high accuracy with less data and faster processing.
Findings
Achieves 80.5% top-1 accuracy on Kinetics-400
Uses 30 times fewer frames per video
Runs 40 times faster than previous leading methods
Abstract
Leading methods in the domain of action recognition try to distill information from both the spatial and temporal dimensions of an input video. Methods that reach State of the Art (SotA) accuracy, usually make use of 3D convolution layers as a way to abstract the temporal information from video frames. The use of such convolutions requires sampling short clips from the input video, where each clip is a collection of closely sampled frames. Since each short clip covers a small fraction of an input video, multiple clips are sampled at inference in order to cover the whole temporal length of the video. This leads to increased computational load and is impractical for real-world applications. We address the computational bottleneck by significantly reducing the number of frames required for inference. Our approach relies on a temporal transformer that applies global attention over video…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Advanced Neural Network Applications
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 3D Convolution · Convolution
