A Unified Audio-Visual Learning Framework for Localization, Separation, and Recognition
Shentong Mo, Pedro Morgado

TL;DR
This paper introduces OneAVM, a unified framework that jointly learns to localize, separate, and recognize sound sources using integrated audio-visual cues, improving performance across all tasks.
Contribution
The paper presents a novel unified model that simultaneously addresses localization, separation, and recognition, capturing their interdependence for enhanced audio-visual perception.
Findings
Effective across multiple datasets including MUSIC and VGG datasets.
Demonstrates strong positive transfer between localization, separation, and recognition tasks.
Outperforms separate task-specific models in all evaluated metrics.
Abstract
The ability to accurately recognize, localize and separate sound sources is fundamental to any audio-visual perception task. Historically, these abilities were tackled separately, with several methods developed independently for each task. However, given the interconnected nature of source localization, separation, and recognition, independent models are likely to yield suboptimal performance as they fail to capture the interdependence between these tasks. To address this problem, we propose a unified audio-visual learning framework (dubbed OneAVM) that integrates audio and visual cues for joint localization, separation, and recognition. OneAVM comprises a shared audio-visual encoder and task-specific decoders trained with three objectives. The first objective aligns audio and visual representations through a localized audio-visual correspondence loss. The second tackles visual source…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech and Audio Processing · Hearing Loss and Rehabilitation · Music and Audio Processing
Methodsfail
