Self-supervised Video Representation Learning with Cross-Stream Prototypical Contrasting
Martine Toering, Ioannis Gatopoulos, Maarten Stol, Vincent Tao Hu

TL;DR
This paper introduces a novel self-supervised learning method for videos that leverages cross-stream prototypical contrast to improve embedding quality, capturing motion without needing optical flow during inference.
Contribution
It proposes a new cross-stream prototypical contrastive approach that predicts consistent prototype assignments from RGB and optical flow views, enhancing video representation learning.
Findings
Achieved 90.5% Top-1 accuracy on UCF101 with S3D backbone.
Outperformed previous methods by +3.2% on UCF101 and +15.1% on HMDB51.
Learned more efficient video embeddings with embedded motion information.
Abstract
Instance-level contrastive learning techniques, which rely on data augmentation and a contrastive loss function, have found great success in the domain of visual representation learning. They are not suitable for exploiting the rich dynamical structure of video however, as operations are done on many augmented instances. In this paper we propose "Video Cross-Stream Prototypical Contrasting", a novel method which predicts consistent prototype assignments from both RGB and optical flow views, operating on sets of samples. Specifically, we alternate the optimization process; while optimizing one of the streams, all views are mapped to one set of stream prototype vectors. Each of the assignments is predicted with all views except the one matching the prediction, pushing representations closer to their assigned prototypes. As a result, more efficient video embeddings with ingrained motion…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Self-supervised Video Representation Learning with Cross-Stream Prototypical Contrasting· youtube
Taxonomy
TopicsHuman Pose and Action Recognition · Advanced Vision and Imaging · Domain Adaptation and Few-Shot Learning
MethodsContrastive Learning · 3D Convolution · Residual Connection · *Communicated@Fast*How Do I Communicate to Expedia? · Average Pooling · Global Average Pooling · Dense Connections · (2+1)D Convolution · Batch Normalization · R(2+1)D
