Dual Contrastive Learning for Spatio-temporal Representation
Shuangrui Ding, Rui Qian, Hongkai Xiong

TL;DR
This paper introduces DCLR, a dual contrastive learning approach that decouples static scene and dynamic motion features to improve self-supervised spatio-temporal video representation learning.
Contribution
The paper proposes a novel dual contrastive formulation that decouples static and dynamic features, addressing background bias in video contrastive learning.
Findings
Achieves state-of-the-art performance on UCF-101, HMDB-51, and Diving-48 datasets.
Effectively encodes static and dynamic features into RGB representations.
Demonstrates improved discrimination of motion patterns over background scenes.
Abstract
Contrastive learning has shown promising potential in self-supervised spatio-temporal representation learning. Most works naively sample different clips to construct positive and negative pairs. However, we observe that this formulation inclines the model towards the background scene bias. The underlying reasons are twofold. First, the scene difference is usually more noticeable and easier to discriminate than the motion difference. Second, the clips sampled from the same video often share similar backgrounds but have distinct motions. Simply regarding them as positive pairs will draw the model to the static background rather than the motion pattern. To tackle this challenge, this paper presents a novel dual contrastive formulation. Concretely, we decouple the input RGB video sequence into two complementary modes, static scene and dynamic motion. Then, the original RGB features are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
