Self-supervised Spatiotemporal Representation Learning by Exploiting   Video Continuity

Hanwen Liang; Niamul Quader; Zhixiang Chi; Lizhe Chen; Peng Dai; Juwei; Lu; Yang Wang

arXiv:2112.05883·cs.CV·January 13, 2022

Self-supervised Spatiotemporal Representation Learning by Exploiting Video Continuity

Hanwen Liang, Niamul Quader, Zhixiang Chi, Lizhe Chen, Peng Dai, Juwei, Lu, Yang Wang

PDF

Open Access 1 Video

TL;DR

This paper introduces a novel self-supervised learning method called CPNet that leverages video continuity to improve video representation learning, outperforming previous methods on various downstream tasks.

Contribution

It proposes three new continuity-based pretext tasks and demonstrates their effectiveness in enhancing video representations beyond existing approaches.

Findings

01

Outperforms prior methods on action recognition, video retrieval, and localization.

02

Combining continuity tasks with other properties improves performance.

03

Learned representations capture local and long-range motion and context.

Abstract

Recent self-supervised video representation learning methods have found significant success by exploring essential properties of videos, e.g. speed, temporal order, etc. This work exploits an essential yet under-explored property of videos, the video continuity, to obtain supervision signals for self-supervised representation learning. Specifically, we formulate three novel continuity-related pretext tasks, i.e. continuity justification, discontinuity localization, and missing section approximation, that jointly supervise a shared backbone for video representation learning. This self-supervision approach, termed as Continuity Perception Network (CPNet), solves the three tasks altogether and encourages the backbone network to learn local and long-ranged motion and context representations. It outperforms prior arts on multiple downstream tasks, such as action recognition, video retrieval,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Self-Supervised Spatiotemporal Representation Learning by Exploiting Video Continuity· underline

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning