Dual Normalization Multitasking for Audio-Visual Sounding Object   Localization

Tokuhiro Nishikawa; Daiki Shimada; Jerry Jun Yokono

arXiv:2106.00180·cs.CV·June 2, 2021

Dual Normalization Multitasking for Audio-Visual Sounding Object Localization

Tokuhiro Nishikawa, Daiki Shimada, Jerry Jun Yokono

PDF

Open Access

TL;DR

This paper introduces a new task called Audio-Visual Sounding Object Localization (AVSOL), proposes a dataset and metrics for evaluation, and presents a novel dual normalization multitasking approach that improves localization accuracy.

Contribution

It defines the AVSOL problem, creates the AVSOL-E dataset with new evaluation metrics, and proposes the DNM architecture for better audio-visual localization performance.

Findings

01

Proposed AVSOL-E dataset for quantitative evaluation.

02

Introduced dual normalization multitasking (DNM) architecture.

03

DNM significantly outperforms baseline methods.

Abstract

Although several research works have been reported on audio-visual sound source localization in unconstrained videos, no datasets and metrics have been proposed in the literature to quantitatively evaluate its performance. Defining the ground truth for sound source localization is difficult, because the location where the sound is produced is not limited to the range of the source object, but the vibrations propagate and spread through the surrounding objects. Therefore we propose a new concept, Sounding Object, to reduce the ambiguity of the visual location of sound, making it possible to annotate the location of the wide range of sound sources. With newly proposed metrics for quantitative evaluation, we formulate the problem of Audio-Visual Sounding Object Localization (AVSOL). We also created the evaluation dataset (AVSOL-E dataset) by manually annotating the test set of well-known…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Hearing Loss and Rehabilitation