Multimodal Clustering Networks for Self-supervised Learning from   Unlabeled Videos

Brian Chen; Andrew Rouditchenko; Kevin Duarte; Hilde Kuehne; Samuel; Thomas; Angie Boggust; Rameswar Panda; Brian Kingsbury; Rogerio Feris; David; Harwath; James Glass; Michael Picheny; Shih-Fu Chang

arXiv:2104.12671·cs.CV·October 18, 2021·1 cites

Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos

Brian Chen, Andrew Rouditchenko, Kevin Duarte, Hilde Kuehne, Samuel, Thomas, Angie Boggust, Rameswar Panda, Brian Kingsbury, Rogerio Feris, David, Harwath, James Glass, Michael Picheny, Shih-Fu Chang

PDF

Open Access 1 Repo

TL;DR

This paper introduces a multimodal self-supervised learning framework that creates a shared embedding space for different modalities, enabling cross-modal retrieval and semantic grouping, demonstrated on large-scale video and text datasets.

Contribution

It extends contrastive learning with a multimodal clustering step, improving semantic understanding and retrieval across modalities in a zero-shot setting.

Findings

01

Achieves state-of-the-art zero-shot retrieval results

02

Enables cross-modal retrieval across unseen datasets

03

Effectively captures semantic similarities in multimodal data

Abstract

Multimodal self-supervised learning is getting more and more attention as it allows not only to train large networks without human supervision but also to search and retrieve data across various modalities. In this context, this paper proposes a self-supervised training framework that learns a common multimodal embedding space that, in addition to sharing representations across different modalities, enforces a grouping of semantically similar instances. To this end, we extend the concept of instance-level contrastive learning with a multimodal clustering step in the training pipeline to capture semantic similarities across modalities. The resulting embedding space enables retrieval of samples across all modalities, even from unseen datasets and different domains. To evaluate our approach, we train our model on the HowTo100M dataset and evaluate its zero-shot retrieval capabilities in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

brian7685/Multimodal-Clustering-Network
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning

MethodsContrastive Learning