Self-Supervised MultiModal Versatile Networks
Jean-Baptiste Alayrac, Adri\`a Recasens, Rosalia Schneider, Relja, Arandjelovi\'c, Jason Ramapuram, Jeffrey De Fauw, Lucas Smaira, Sander, Dieleman, Andrew Zisserman

TL;DR
This paper introduces a versatile self-supervised multi-modal network that integrates visual, audio, and language data from videos, enabling effective representations for various downstream tasks across multiple modalities.
Contribution
The work presents a novel multimodal versatile network capable of combining multiple modalities and applying to diverse tasks, with a new deflation process for static images and videos.
Findings
Achieved state-of-the-art results on UCF101, HMDB51, Kinetics600, AudioSet, and ESC-50.
Demonstrated effective multi-modal representation learning from unlabelled video data.
Network can be applied to video, image, audio, and video-text tasks.
Abstract
Videos are a rich source of multi-modal supervision. In this work, we learn representations using self-supervision by leveraging three modalities naturally present in videos: visual, audio and language streams. To this end, we introduce the notion of a multimodal versatile network -- a network that can ingest multiple modalities and whose representations enable downstream tasks in multiple modalities. In particular, we explore how best to combine the modalities, such that fine-grained representations of the visual and audio modalities can be maintained, whilst also integrating text into a common embedding. Driven by versatility, we also introduce a novel process of deflation, so that the networks can be effortlessly applied to the visual data in the form of video or a static image. We demonstrate how such networks trained on large collections of unlabelled video data can be applied on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
MethodsDeflation
