A Unified Audio-Visual Learning Framework for Localization, Separation,   and Recognition

Shentong Mo; Pedro Morgado

arXiv:2305.19458·cs.SD·June 1, 2023·6 cites

A Unified Audio-Visual Learning Framework for Localization, Separation, and Recognition

Shentong Mo, Pedro Morgado

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces OneAVM, a unified framework that jointly learns to localize, separate, and recognize sound sources using integrated audio-visual cues, improving performance across all tasks.

Contribution

The paper presents a novel unified model that simultaneously addresses localization, separation, and recognition, capturing their interdependence for enhanced audio-visual perception.

Findings

01

Effective across multiple datasets including MUSIC and VGG datasets.

02

Demonstrates strong positive transfer between localization, separation, and recognition tasks.

03

Outperforms separate task-specific models in all evaluated metrics.

Abstract

The ability to accurately recognize, localize and separate sound sources is fundamental to any audio-visual perception task. Historically, these abilities were tackled separately, with several methods developed independently for each task. However, given the interconnected nature of source localization, separation, and recognition, independent models are likely to yield suboptimal performance as they fail to capture the interdependence between these tasks. To address this problem, we propose a unified audio-visual learning framework (dubbed OneAVM) that integrates audio and visual cues for joint localization, separation, and recognition. OneAVM comprises a shared audio-visual encoder and task-specific decoders trained with three objectives. The first objective aligns audio and visual representations through a localized audio-visual correspondence loss. The second tackles visual source…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

stonemo/oneavm
noneOfficial

Videos

A Unified Audio-Visual Learning Framework for Localization, Separation, and Recognition· slideslive

Taxonomy

TopicsSpeech and Audio Processing · Hearing Loss and Rehabilitation · Music and Audio Processing

Methodsfail