An Image is Worth 16x16 Words, What is a Video Worth?

Gilad Sharir; Asaf Noy; Lihi Zelnik-Manor

arXiv:2103.13915·cs.CV·May 28, 2021·56 cites

An Image is Worth 16x16 Words, What is a Video Worth?

Gilad Sharir, Asaf Noy, Lihi Zelnik-Manor

PDF

Open Access 2 Repos

TL;DR

This paper introduces a temporal transformer approach for action recognition in videos that significantly reduces computational requirements while maintaining state-of-the-art accuracy by efficiently capturing salient information across frames.

Contribution

The authors propose a global attention-based temporal transformer that reduces the number of frames needed for inference, achieving high accuracy with less data and faster processing.

Findings

01

Achieves 80.5% top-1 accuracy on Kinetics-400

02

Uses 30 times fewer frames per video

03

Runs 40 times faster than previous leading methods

Abstract

Leading methods in the domain of action recognition try to distill information from both the spatial and temporal dimensions of an input video. Methods that reach State of the Art (SotA) accuracy, usually make use of 3D convolution layers as a way to abstract the temporal information from video frames. The use of such convolutions requires sampling short clips from the input video, where each clip is a collection of closely sampled frames. Since each short clip covers a small fraction of an input video, multiple clips are sampled at inference in order to cover the whole temporal length of the video. This leads to increased computational load and is impractical for real-world applications. We address the computational bottleneck by significantly reducing the number of frames required for inference. Our approach relies on a temporal transformer that applies global attention over video…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Advanced Neural Network Applications

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 3D Convolution · Convolution