Visual Sound Localization in the Wild by Cross-Modal Interference Erasing
Xian Liu, Rui Qian, Hang Zhou, Di Hu, Weiyao Lin, Ziwei Liu, Bolei, Zhou, Xiaowei Zhou

TL;DR
This paper introduces the Interference Eraser framework for audio-visual sound source localization in real-world scenarios, effectively removing off-screen sounds and background noise to improve localization accuracy.
Contribution
It proposes a novel framework with modules for discriminative audio representation and cross-modal interference removal, addressing real-world challenges in sound localization.
Findings
Achieves superior localization results in wild scenarios
Effectively removes off-screen sounds and background noise
Outperforms previous methods in real-world tests
Abstract
The task of audio-visual sound source localization has been well studied under constrained scenes, where the audio recordings are clean. However, in real-world scenarios, audios are usually contaminated by off-screen sound and background noise. They will interfere with the procedure of identifying desired sources and building visual-sound connections, making previous studies non-applicable. In this work, we propose the Interference Eraser (IEr) framework, which tackles the problem of audio-visual sound source localization in the wild. The key idea is to eliminate the interference by redefining and carving discriminative audio representations. Specifically, we observe that the previous practice of learning only a single audio representation is insufficient due to the additive nature of audio signals. We thus extend the audio representation with our Audio-Instance-Identifier module, which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Hearing Loss and Rehabilitation
