Self-supervised Co-training for Video Representation Learning
Tengda Han, Weidi Xie, Andrew Zisserman

TL;DR
This paper introduces a self-supervised co-training method for video representation learning that leverages multiple views to enhance contrastive learning, achieving state-of-the-art results efficiently on action recognition and video retrieval tasks.
Contribution
It proposes a novel co-training scheme that exploits complementary views like RGB and optical flow to improve contrastive learning in video representations.
Findings
Achieves state-of-the-art or comparable performance on downstream tasks.
Requires less training data for similar performance.
Enhances contrastive learning with multi-view co-training.
Abstract
The objective of this paper is visual-only self-supervised video representation learning. We make the following contributions: (i) we investigate the benefit of adding semantic-class positives to instance-based Info Noise Contrastive Estimation (InfoNCE) training, showing that this form of supervised contrastive learning leads to a clear improvement in performance; (ii) we propose a novel self-supervised co-training scheme to improve the popular infoNCE loss, exploiting the complementary information from different views, RGB streams and optical flow, of the same data source by using one view to obtain positive class samples for the other; (iii) we thoroughly evaluate the quality of the learnt representation on two different downstream tasks: action recognition and video retrieval. In both cases, the proposed approach demonstrates state-of-the-art or comparable performance with other…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
MethodsINFO: An Efficient Optimization Algorithm based on Weighted Mean of Vectors · Contrastive Learning · InfoNCE
