Learning to Localize Sound Sources in Visual Scenes: Analysis and Applications
Arda Senocak, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, In So Kweon

TL;DR
This paper introduces a novel unsupervised, supervised, and semi-supervised approach for localizing sound sources in visual scenes, demonstrating its effectiveness and versatility in applications like camera view panning in 360-degree videos.
Contribution
It presents a new two-stream network architecture with attention mechanisms for sound source localization and extends it to semi-supervised learning to correct false conclusions.
Findings
Unsupervised method can localize sound sources without human annotation.
Semi-supervised approach effectively corrects false localizations with minimal supervision.
The learned embeddings are versatile for cross-modal content alignment and applications like camera panning.
Abstract
Visual events are usually accompanied by sounds in our daily lives. However, can the machines learn to correlate the visual scene and sound, as well as localize the sound source only by observing them like humans? To investigate its empirical learnability, in this work we first present a novel unsupervised algorithm to address the problem of localizing sound sources in visual scenes. In order to achieve this goal, a two-stream network structure which handles each modality with attention mechanism is developed for sound source localization. The network naturally reveals the localized response in the scene without human annotation. In addition, a new sound source dataset is developed for performance evaluation. Nevertheless, our empirical evaluation shows that the unsupervised method generates false conclusions in some cases. Thereby, we show that this false conclusion cannot be fixed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Advanced Vision and Imaging
