Self-supervised Video Representation Learning by Context and Motion   Decoupling

Lianghua Huang; Yu Liu; Bin Wang; Pan Pan; Yinghui Xu; Rong Jin

arXiv:2104.00862·cs.CV·April 5, 2021·5 cites

Self-supervised Video Representation Learning by Context and Motion Decoupling

Lianghua Huang, Yu Liu, Bin Wang, Pan Pan, Yinghui Xu, Rong Jin

PDF

Open Access 1 Repo

TL;DR

This paper introduces a self-supervised video representation learning method that explicitly decouples motion and context information using compressed video data, leading to significant improvements in video retrieval and action recognition.

Contribution

The authors propose a novel pretext task framework that extracts motion and context supervision from compressed videos, enhancing representation quality over prior implicit methods.

Findings

01

Improves video retrieval recall by 16.0% on UCF101 and 11.1% on HMDB51.

02

Motion prediction as an auxiliary task boosts action recognition accuracy by up to 13.8%.

03

Efficient extraction of supervision signals at over 500 fps on CPU.

Abstract

A key challenge in self-supervised video representation learning is how to effectively capture motion information besides context bias. While most existing works implicitly achieve this with video-specific pretext tasks (e.g., predicting clip orders, time arrows, and paces), we develop a method that explicitly decouples motion supervision from context bias through a carefully designed pretext task. Specifically, we take the keyframes and motion vectors in compressed videos (e.g., in H.264 format) as the supervision sources for context and motion, respectively, which can be efficiently extracted at over 500 fps on the CPU. Then we design two pretext tasks that are jointly optimized: a context matching task where a pairwise contrastive loss is cast between video clip and keyframe features; and a motion prediction task where clip features, passed through an encoder-decoder network, are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

alibaba/Deep-Vision/tree/main/Context-Motion-Decoupling
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Advanced Vision and Imaging