Self-Supervised Video Representation Learning with Motion-Contrastive   Perception

Jinyu Liu; Ying Cheng; Yuejie Zhang; Rui-Wei Zhao; Rui Feng

arXiv:2204.04607·cs.CV·April 12, 2022

Self-Supervised Video Representation Learning with Motion-Contrastive Perception

Jinyu Liu, Ying Cheng, Yuejie Zhang, Rui-Wei Zhao, Rui Feng

PDF

Open Access

TL;DR

This paper introduces MCPNet, a self-supervised video learning model that emphasizes motion-specific features using a novel long-range residual frame view, outperforming existing methods on benchmark datasets.

Contribution

The paper proposes a new view called long-range residual frame and a dual-branch network, MCPNet, to enhance motion-specific and semantic learning in self-supervised video representation.

Findings

01

Outperforms state-of-the-art on UCF-101 and HMDB-51 datasets.

02

Effectively captures fine-grained motion features.

03

Balances motion perception with semantic understanding.

Abstract

Visual-only self-supervised learning has achieved significant improvement in video representation learning. Existing related methods encourage models to learn video representations by utilizing contrastive learning or designing specific pretext tasks. However, some models are likely to focus on the background, which is unimportant for learning video representations. To alleviate this problem, we propose a new view called long-range residual frame to obtain more motion-specific information. Based on this, we propose the Motion-Contrastive Perception Network (MCPNet), which consists of two branches, namely, Motion Information Perception (MIP) and Contrastive Instance Perception (CIP), to learn generic video representations by focusing on the changing areas in videos. Specifically, the MIP branch aims to learn fine-grained motion features, and the CIP branch performs contrastive learning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Human Pose and Action Recognition · Multimodal Machine Learning Applications

MethodsContrastive Learning