Contrastive Spatio-Temporal Pretext Learning for Self-supervised Video   Representation

Yujia Zhang; Lai-Man Po; Xuyuan Xu; Mengyang Liu; Yexin Wang; Weifeng; Ou; Yuzhi Zhao; Wing-Yin Yu

arXiv:2112.08913·cs.CV·December 21, 2021

Contrastive Spatio-Temporal Pretext Learning for Self-supervised Video Representation

Yujia Zhang, Lai-Man Po, Xuyuan Xu, Mengyang Liu, Yexin Wang, Weifeng, Ou, Yuzhi Zhao, Wing-Yin Yu

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a novel spatio-temporal overlap rate prediction task for self-supervised video representation learning, combining it with contrastive learning to improve understanding of videos.

Contribution

It proposes the STOR pretext task and a joint optimization framework that enhances spatio-temporal video representations beyond existing methods.

Findings

01

STOR task improves contrastive learning effectiveness

02

Joint optimization significantly boosts video understanding performance

03

Method outperforms previous self-supervised approaches

Abstract

Spatio-temporal representation learning is critical for video self-supervised representation. Recent approaches mainly use contrastive learning and pretext tasks. However, these approaches learn representation by discriminating sampled instances via feature similarity in the latent space while ignoring the intermediate state of the learned representations, which limits the overall performance. In this work, taking into account the degree of similarity of sampled instances as the intermediate state, we propose a novel pretext task - spatio-temporal overlap rate (STOR) prediction. It stems from the observation that humans are capable of discriminating the overlap rates of videos in space and time. This task encourages the model to discriminate the STOR of two generated samples to learn the representations. Moreover, we employ a joint optimization combining pretext tasks with contrastive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

katou2/cstp
pytorchOfficial

Videos

Contrastive Spatio-Temporal Pretext Learning for Self-Supervised Video Representation· underline

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Human Pose and Action Recognition

MethodsContrastive Learning