Self-supervised Video Representation Learning by Context and Motion Decoupling
Lianghua Huang, Yu Liu, Bin Wang, Pan Pan, Yinghui Xu, Rong Jin

TL;DR
This paper introduces a self-supervised video representation learning method that explicitly decouples motion and context information using compressed video data, leading to significant improvements in video retrieval and action recognition.
Contribution
The authors propose a novel pretext task framework that extracts motion and context supervision from compressed videos, enhancing representation quality over prior implicit methods.
Findings
Improves video retrieval recall by 16.0% on UCF101 and 11.1% on HMDB51.
Motion prediction as an auxiliary task boosts action recognition accuracy by up to 13.8%.
Efficient extraction of supervision signals at over 500 fps on CPU.
Abstract
A key challenge in self-supervised video representation learning is how to effectively capture motion information besides context bias. While most existing works implicitly achieve this with video-specific pretext tasks (e.g., predicting clip orders, time arrows, and paces), we develop a method that explicitly decouples motion supervision from context bias through a carefully designed pretext task. Specifically, we take the keyframes and motion vectors in compressed videos (e.g., in H.264 format) as the supervision sources for context and motion, respectively, which can be efficiently extracted at over 500 fps on the CPU. Then we design two pretext tasks that are jointly optimized: a context matching task where a pairwise contrastive loss is cast between video clip and keyframe features; and a motion prediction task where clip features, passed through an encoder-decoder network, are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Advanced Vision and Imaging
