Self-Supervised MultiModal Versatile Networks

Jean-Baptiste Alayrac; Adri\`a Recasens; Rosalia Schneider; Relja; Arandjelovi\'c; Jason Ramapuram; Jeffrey De Fauw; Lucas Smaira; Sander; Dieleman; Andrew Zisserman

arXiv:2006.16228·cs.CV·November 2, 2020·195 cites

Self-Supervised MultiModal Versatile Networks

Jean-Baptiste Alayrac, Adri\`a Recasens, Rosalia Schneider, Relja, Arandjelovi\'c, Jason Ramapuram, Jeffrey De Fauw, Lucas Smaira, Sander, Dieleman, Andrew Zisserman

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a versatile self-supervised multi-modal network that integrates visual, audio, and language data from videos, enabling effective representations for various downstream tasks across multiple modalities.

Contribution

The work presents a novel multimodal versatile network capable of combining multiple modalities and applying to diverse tasks, with a new deflation process for static images and videos.

Findings

01

Achieved state-of-the-art results on UCF101, HMDB51, Kinetics600, AudioSet, and ESC-50.

02

Demonstrated effective multi-modal representation learning from unlabelled video data.

03

Network can be applied to video, image, audio, and video-text tasks.

Abstract

Videos are a rich source of multi-modal supervision. In this work, we learn representations using self-supervision by leveraging three modalities naturally present in videos: visual, audio and language streams. To this end, we introduce the notion of a multimodal versatile network -- a network that can ingest multiple modalities and whose representations enable downstream tasks in multiple modalities. In particular, we explore how best to combine the modalities, such that fine-grained representations of the visual and audio modalities can be maintained, whilst also integrating text into a common embedding. Driven by versatility, we also introduce a novel process of deflation, so that the networks can be effortlessly applied to the visual data in the form of video or a static image. We demonstrate how such networks trained on large collections of unlabelled video data can be applied on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

deepmind/deepmind-research/tree/master/mmv
jaxOfficial

Videos

Self-Supervised MultiModal Versatile Networks· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning

MethodsDeflation