TL;DR
CoCon introduces a cooperative contrastive learning approach for video representation that leverages multiple views and inter-instance relationships, improving action recognition performance and capturing higher-order class relationships.
Contribution
It proposes a novel cooperative contrastive learning framework that utilizes complementary multi-view data and inter-instance relationships for better video representations.
Findings
Achieves competitive results on UCF101, HMDB51, Kinetics400
Effectively captures higher-order class relationships
Utilizes implicit relationships between video views
Abstract
Labeling videos at scale is impractical. Consequently, self-supervised visual representation learning is key for efficient video analysis. Recent success in learning image representations suggests contrastive learning is a promising framework to tackle this challenge. However, when applied to real-world videos, contrastive learning may unknowingly lead to the separation of instances that contain semantically similar events. In our work, we introduce a cooperative variant of contrastive learning to utilize complementary information across views and address this issue. We use data-driven sampling to leverage implicit relationships between multiple input video views, whether observed (e.g. RGB) or inferred (e.g. flow, segmentation masks, poses). We are one of the firsts to explore exploiting inter-instance relationships to drive learning. We experimentally evaluate our representations on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsContrastive Learning
