Motion Sensitive Contrastive Learning for Self-supervised Video Representation
Jingcheng Ni, Nan Zhou, Jie Qin, Qian Wu, Junqi Liu, Boxun Li, Di, Huang

TL;DR
This paper introduces Motion Sensitive Contrastive Learning (MSCL), a novel approach that enhances self-supervised video representation by integrating motion information from optical flows with RGB frames, improving performance on multiple benchmarks.
Contribution
The paper proposes MSCL with local motion contrastive learning, flow rotation augmentation, and motion differential sampling, advancing contrastive learning for video understanding by explicitly modeling motion dynamics.
Findings
Achieves 91.5% top-1 accuracy on UCF101
Attains 50.3% top-1 accuracy on Something-Something v2
Reaches 65.6% top-1 recall on UCF101 for retrieval
Abstract
Contrastive learning has shown great potential in video representation learning. However, existing approaches fail to sufficiently exploit short-term motion dynamics, which are crucial to various down-stream video understanding tasks. In this paper, we propose Motion Sensitive Contrastive Learning (MSCL) that injects the motion information captured by optical flows into RGB frames to strengthen feature learning. To achieve this, in addition to clip-level global contrastive learning, we develop Local Motion Contrastive Learning (LMCL) with frame-level contrastive objectives across the two modalities. Moreover, we introduce Flow Rotation Augmentation (FRA) to generate extra motion-shuffled negative samples and Motion Differential Sampling (MDS) to accurately screen training samples. Extensive experiments on standard benchmarks validate the effectiveness of the proposed method. With the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
MethodsContrastive Learning
