Controllable Augmentations for Video Representation Learning
Rui Qian, Weiyao Lin, John See, Dian Li

TL;DR
This paper introduces a controllable augmentation framework for self-supervised video representation learning, improving temporal modeling and generalization by combining local and global information with contrastive learning.
Contribution
It proposes a novel controllable augmentation approach that jointly leverages local clips and global videos to enhance temporal structure learning in videos.
Findings
Outperforms existing methods on three video benchmarks.
Achieves more accurate temporal dynamics modeling.
Enhances generalization by mutual information minimization.
Abstract
This paper focuses on self-supervised video representation learning. Most existing approaches follow the contrastive learning pipeline to construct positive and negative pairs by sampling different clips. However, this formulation tends to bias to static background and have difficulty establishing global temporal structures. The major reason is that the positive pairs, i.e., different clips sampled from the same video, have limited temporal receptive field, and usually share similar background but differ in motions. To address these problems, we propose a framework to jointly utilize local clips and global videos to learn from detailed region-level correspondence as well as general long-term temporal relations. Based on a set of controllable augmentations, we achieve accurate appearance and motion pattern alignment through soft spatio-temporal region contrast. Our formulation is able to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Advanced Vision and Imaging
MethodsContrastive Learning
