Audiovisual Masked Autoencoders

Mariana-Iuliana Georgescu; Eduardo Fonseca; Radu Tudor Ionescu; Mario; Lucic; Cordelia Schmid; Anurag Arnab

arXiv:2212.05922·cs.CV·January 5, 2024

Audiovisual Masked Autoencoders

Mariana-Iuliana Georgescu, Eduardo Fonseca, Radu Tudor Ionescu, Mario, Lucic, Cordelia Schmid, Anurag Arnab

PDF

Open Access 2 Repos 1 Models 1 Video

TL;DR

This paper introduces audiovisual masked autoencoders that leverage video data to improve self-supervised learning, achieving state-of-the-art results on multiple audiovisual and unimodal tasks.

Contribution

It proposes a novel audiovisual pretraining framework within masked autoencoding, enhancing representation learning across diverse downstream tasks.

Findings

01

Surpasses state-of-the-art on VGGSound and AudioSet

02

Enables transfer to unimodal tasks with a single model

03

Achieves top results on Epic Kitchens without dataset-specific pretraining

Abstract

Can we leverage the audiovisual information already present in video to improve self-supervised representation learning? To answer this question, we study various pretraining architectures and objectives within the masked autoencoding framework, motivated by the success of similar methods in natural language and image understanding. We show that we can achieve significant improvements on audiovisual downstream classification tasks, surpassing the state-of-the-art on VGGSound and AudioSet. Furthermore, we can leverage our audiovisual pretraining scheme for multiple unimodal downstream tasks using a single audiovisual pretrained model. We additionally demonstrate the transferability of our representations, achieving state-of-the-art audiovisual results on Epic Kitchens without pretraining specifically for this dataset.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
fcxfcx/owlv2
model· ♡ 1
♡ 1

Videos

Audiovisual Masked Autoencoders· youtube

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Digital Media Forensic Detection