Cross-modal Manifold Cutmix for Self-supervised Video Representation   Learning

Srijan Das; Michael S. Ryoo

arXiv:2112.03906·cs.CV·July 31, 2023

Cross-modal Manifold Cutmix for Self-supervised Video Representation Learning

Srijan Das, Michael S. Ryoo

PDF

Open Access

TL;DR

This paper introduces Cross-Modal Manifold Cutmix, a novel video augmentation method that combines videos across different modalities in feature space, improving self-supervised video representation learning with less data.

Contribution

The paper proposes a new cross-modal video mixing strategy, STC-mix, that enhances self-supervised learning by integrating videos across modalities in feature space.

Findings

01

STC-mix improves downstream task performance on UCF101 and HMDB51.

02

STC-mix achieves comparable results to existing methods with less training data.

03

Effective on datasets with limited domain knowledge, like NTU.

Abstract

Contrastive representation learning of videos highly relies on the availability of millions of unlabelled videos. This is practical for videos available on web but acquiring such large scale of videos for real-world applications is very expensive and laborious. Therefore, in this paper we focus on designing video augmentation for self-supervised learning, we first analyze the best strategy to mix videos to create a new augmented video sample. Then, the question remains, can we make use of the other modalities in videos for data mixing? To this end, we propose Cross-Modal Manifold Cutmix (CMMC) that inserts a video tesseract into another video tesseract in the feature space across two different modalities. We find that our video mixing strategy STC-mix, i.e. preliminary mixing of videos followed by CMMC across different modalities in a video, improves the quality of learned video…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning

MethodsCutMix