Dual Normalization Multitasking for Audio-Visual Sounding Object Localization
Tokuhiro Nishikawa, Daiki Shimada, Jerry Jun Yokono

TL;DR
This paper introduces a new task called Audio-Visual Sounding Object Localization (AVSOL), proposes a dataset and metrics for evaluation, and presents a novel dual normalization multitasking approach that improves localization accuracy.
Contribution
It defines the AVSOL problem, creates the AVSOL-E dataset with new evaluation metrics, and proposes the DNM architecture for better audio-visual localization performance.
Findings
Proposed AVSOL-E dataset for quantitative evaluation.
Introduced dual normalization multitasking (DNM) architecture.
DNM significantly outperforms baseline methods.
Abstract
Although several research works have been reported on audio-visual sound source localization in unconstrained videos, no datasets and metrics have been proposed in the literature to quantitatively evaluate its performance. Defining the ground truth for sound source localization is difficult, because the location where the sound is produced is not limited to the range of the source object, but the vibrations propagate and spread through the surrounding objects. Therefore we propose a new concept, Sounding Object, to reduce the ambiguity of the visual location of sound, making it possible to annotate the location of the wide range of sound sources. With newly proposed metrics for quantitative evaluation, we formulate the problem of Audio-Visual Sounding Object Localization (AVSOL). We also created the evaluation dataset (AVSOL-E dataset) by manually annotating the test set of well-known…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Hearing Loss and Rehabilitation
