Video Representation Learning by Recognizing Temporal Transformations
Simon Jenni, Givi Meishvili, Paolo Favaro

TL;DR
This paper presents a self-supervised video representation learning method that uses temporal transformations to improve motion understanding, boosting action recognition performance without human annotations.
Contribution
It introduces a novel self-supervised approach based on discriminating videos from their temporally transformed versions to learn motion-sensitive representations.
Findings
Improved action recognition accuracy on UCF101 and HMDB51 datasets.
Effective learning of motion features without manual annotations.
Transformations based on time warps enhance the discriminative power of learned representations.
Abstract
We introduce a novel self-supervised learning approach to learn representations of videos that are responsive to changes in the motion dynamics. Our representations can be learned from data without human annotation and provide a substantial boost to the training of neural networks on small labeled data sets for tasks such as action recognition, which require to accurately distinguish the motion of objects. We promote an accurate learning of motion without human annotation by training a neural network to discriminate a video sequence from its temporally transformed versions. To learn to distinguish non-trivial motions, the design of the transformations is based on two principles: 1) To define clusters of motions based on time warps of different magnitude; 2) To ensure that the discrimination is feasible only by observing and analyzing as many image frames as possible. Thus, we introduce…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
