Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning

Dezhao Luo; Chang Liu; Yu Zhou; Dongbao Yang; Can Ma; Qixiang Ye,; Weiping Wang

arXiv:2001.00294·cs.CV·January 3, 2020·24 cites

Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning

Dezhao Luo, Chang Liu, Yu Zhou, Dongbao Yang, Can Ma, Qixiang Ye,, Weiping Wang

PDF

Open Access 1 Repo

TL;DR

The paper introduces Video Cloze Procedure (VCP), a self-supervised learning method that improves spatio-temporal video representations by predicting applied operations, leading to state-of-the-art results in action recognition and video retrieval.

Contribution

VCP is a novel self-supervised task that enhances video representation learning by predicting spatio-temporal operations, offering flexibility as a proxy or target task.

Findings

01

Outperforms state-of-the-art self-supervised models on benchmarks

02

Improves action recognition accuracy

03

Enhances video retrieval performance

Abstract

We propose a novel self-supervised method, referred to as Video Cloze Procedure (VCP), to learn rich spatial-temporal representations. VCP first generates "blanks" by withholding video clips and then creates "options" by applying spatio-temporal operations on the withheld clips. Finally, it fills the blanks with "options" and learns representations by predicting the categories of operations applied on the clips. VCP can act as either a proxy task or a target task in self-supervised learning. As a proxy task, it converts rich self-supervised representations into video clip operations (options), which enhances the flexibility and reduces the complexity of representation learning. As a target task, it can assess learned representation models in a uniform and interpretable manner. With VCP, we train spatial-temporal representation models (3D-CNNs) and apply such models on action recognition…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

BestJuly/VCP
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning