Spatiotemporal Contrastive Video Representation Learning
Rui Qian, Tianjian Meng, Boqing Gong, Ming-Hsuan Yang, Huisheng Wang,, Serge Belongie, Yin Cui

TL;DR
This paper introduces a self-supervised contrastive learning method for video representations, emphasizing the importance of spatial and temporal augmentations, achieving state-of-the-art results on Kinetics-600.
Contribution
It proposes novel spatial and temporal augmentation techniques for video contrastive learning and demonstrates significant performance improvements over prior methods.
Findings
Achieves 70.4% top-1 accuracy on Kinetics-600 with R3D-50 backbone.
Outperforms ImageNet supervised pre-training by 15.7%.
Further improves to 72.9% accuracy with R3D-152 backbone.
Abstract
We present a self-supervised Contrastive Video Representation Learning (CVRL) method to learn spatiotemporal visual representations from unlabeled videos. Our representations are learned using a contrastive loss, where two augmented clips from the same short video are pulled together in the embedding space, while clips from different videos are pushed away. We study what makes for good data augmentations for video self-supervised learning and find that both spatial and temporal information are crucial. We carefully design data augmentations involving spatial and temporal cues. Concretely, we propose a temporally consistent spatial augmentation method to impose strong spatial augmentations on each frame of the video while maintaining the temporal consistency across frames. We also propose a sampling-based temporal augmentation method to avoid overly enforcing invariance on clips that are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
Methods3D Convolution · Temporally Consistent Spatial Augmentation · Contrastive Video Representation Learning · Average Pooling · Convolution · Dense Connections · 1x1 Convolution · Global Average Pooling · Random Gaussian Blur · Batch Normalization
