Self-supervised Video Representation Learning with Cross-Stream   Prototypical Contrasting

Martine Toering; Ioannis Gatopoulos; Maarten Stol; Vincent Tao Hu

arXiv:2106.10137·cs.CV·October 22, 2021

Self-supervised Video Representation Learning with Cross-Stream Prototypical Contrasting

Martine Toering, Ioannis Gatopoulos, Maarten Stol, Vincent Tao Hu

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a novel self-supervised learning method for videos that leverages cross-stream prototypical contrast to improve embedding quality, capturing motion without needing optical flow during inference.

Contribution

It proposes a new cross-stream prototypical contrastive approach that predicts consistent prototype assignments from RGB and optical flow views, enhancing video representation learning.

Findings

01

Achieved 90.5% Top-1 accuracy on UCF101 with S3D backbone.

02

Outperformed previous methods by +3.2% on UCF101 and +15.1% on HMDB51.

03

Learned more efficient video embeddings with embedded motion information.

Abstract

Instance-level contrastive learning techniques, which rely on data augmentation and a contrastive loss function, have found great success in the domain of visual representation learning. They are not suitable for exploiting the rich dynamical structure of video however, as operations are done on many augmented instances. In this paper we propose "Video Cross-Stream Prototypical Contrasting", a novel method which predicts consistent prototype assignments from both RGB and optical flow views, operating on sets of samples. Specifically, we alternate the optimization process; while optimizing one of the streams, all views are mapped to one set of stream prototype vectors. Each of the assignments is predicted with all views except the one matching the prediction, pushing representations closer to their assigned prototypes. As a result, more efficient video embeddings with ingrained motion…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

martinetoering/ViCC
pytorchOfficial

Videos

Self-supervised Video Representation Learning with Cross-Stream Prototypical Contrasting· youtube

Taxonomy

TopicsHuman Pose and Action Recognition · Advanced Vision and Imaging · Domain Adaptation and Few-Shot Learning

MethodsContrastive Learning · 3D Convolution · Residual Connection · *Communicated@Fast*How Do I Communicate to Expedia? · Average Pooling · Global Average Pooling · Dense Connections · (2+1)D Convolution · Batch Normalization · R(2+1)D