Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense   Interactions through Masked Modeling

Shentong Mo; Pedro Morgado

arXiv:2312.01017·cs.CV·December 5, 2023·1 cites

Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense Interactions through Masked Modeling

Shentong Mo, Pedro Morgado

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel masked reconstruction training framework and an attention-based fusion module for early fusion audio-visual models, enabling efficient learning of fine-grained interactions and improving performance across multiple multimodal tasks.

Contribution

It proposes a new training approach using masked reconstruction and an interaction-aware fusion module for early fusion audio-visual models, addressing computational challenges and enhancing multimodal understanding.

Findings

01

Outperforms existing methods in audio-event classification

02

Improves visual sound localization accuracy

03

Enhances sound separation and audio-visual segmentation

Abstract

Humans possess a remarkable ability to integrate auditory and visual information, enabling a deeper understanding of the surrounding environment. This early fusion of audio and visual cues, demonstrated through cognitive psychology and neuroscience research, offers promising potential for developing multimodal perception models. However, training early fusion architectures poses significant challenges, as the increased model expressivity requires robust learning frameworks to harness their enhanced capabilities. In this paper, we address this challenge by leveraging the masked reconstruction framework, previously successful in unimodal settings, to train audio-visual encoders with early fusion. Additionally, we propose an attention-based fusion module that captures interactions between local audio and visual representations, enhancing the model's ability to capture fine-grained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

stonemo/deepavfusion
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Hearing Loss and Rehabilitation · Music and Audio Processing