Time-Equivariant Contrastive Video Representation Learning

Simon Jenni; Hailin Jin

arXiv:2112.03624·cs.CV·December 8, 2021

Time-Equivariant Contrastive Video Representation Learning

Simon Jenni, Hailin Jin

PDF

Open Access

TL;DR

This paper presents a self-supervised contrastive learning approach that encodes temporal transformations to learn video representations that preserve dynamics, achieving state-of-the-art results in video retrieval and action recognition.

Contribution

It introduces a novel method that enforces temporal equivariance in video representations, unlike previous invariance-based approaches.

Findings

01

Achieves state-of-the-art results on UCF101, HMDB51, and Diving48.

02

Effectively encodes temporal transformations for better video understanding.

03

Improves video retrieval and action recognition performance.

Abstract

We introduce a novel self-supervised contrastive learning method to learn representations from unlabelled videos. Existing approaches ignore the specifics of input distortions, e.g., by learning invariance to temporal transformations. Instead, we argue that video representation should preserve video dynamics and reflect temporal manipulations of the input. Therefore, we exploit novel constraints to build representations that are equivariant to temporal transformations and better capture video dynamics. In our method, relative temporal transformations between augmented clips of a video are encoded in a vector and contrasted with other transformation vectors. To support temporal equivariance learning, we additionally propose the self-supervised classification of two clips of a video into 1. overlapping 2. ordered, or 3. unordered. Our experiments show that time-equivariant representations…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning

MethodsContrastive Learning