Video Jigsaw: Unsupervised Learning of Spatiotemporal Context for Video Action Recognition
Unaiza Ahsan, Rishi Madhok, Irfan Essa

TL;DR
This paper introduces a self-supervised learning method that combines spatial and temporal context in videos by solving jigsaw puzzles across multiple frames, enabling effective action recognition without heavy preprocessing.
Contribution
It presents a novel joint spatial-temporal self-supervised framework using a new permutation strategy for video understanding tasks.
Findings
Achieves strong performance on benchmark datasets.
No labeled data needed for pretraining.
Outperforms previous methods in unsupervised video recognition.
Abstract
We propose a self-supervised learning method to jointly reason about spatial and temporal context for video recognition. Recent self-supervised approaches have used spatial context [9, 34] as well as temporal coherency [32] but a combination of the two requires extensive preprocessing such as tracking objects through millions of video frames [59] or computing optical flow to determine frame regions with high motion [30]. We propose to combine spatial and temporal context in one self-supervised framework without any heavy preprocessing. We divide multiple video frames into grids of patches and train a network to solve jigsaw puzzles on these patches from multiple frames. So the network is trained to correctly identify the position of a patch within a video frame as well as the position of a patch over time. We also propose a novel permutation strategy that outperforms random permutations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Gait Recognition and Analysis
MethodsJigsaw
