Learning to Localize Sound Sources in Visual Scenes: Analysis and   Applications

Arda Senocak; Tae-Hyun Oh; Junsik Kim; Ming-Hsuan Yang; In So Kweon

arXiv:1911.09649·cs.CV·November 22, 2019·1 cites

Learning to Localize Sound Sources in Visual Scenes: Analysis and Applications

Arda Senocak, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, In So Kweon

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel unsupervised, supervised, and semi-supervised approach for localizing sound sources in visual scenes, demonstrating its effectiveness and versatility in applications like camera view panning in 360-degree videos.

Contribution

It presents a new two-stream network architecture with attention mechanisms for sound source localization and extends it to semi-supervised learning to correct false conclusions.

Findings

01

Unsupervised method can localize sound sources without human annotation.

02

Semi-supervised approach effectively corrects false localizations with minimal supervision.

03

The learned embeddings are versatile for cross-modal content alignment and applications like camera panning.

Abstract

Visual events are usually accompanied by sounds in our daily lives. However, can the machines learn to correlate the visual scene and sound, as well as localize the sound source only by observing them like humans? To investigate its empirical learnability, in this work we first present a novel unsupervised algorithm to address the problem of localizing sound sources in visual scenes. In order to achieve this goal, a two-stream network structure which handles each modality with attention mechanism is developed for sound source localization. The network naturally reveals the localized response in the scene without human annotation. In addition, a new sound source dataset is developed for performance evaluation. Nevertheless, our empirical evaluation shows that the unsupervised method generates false conclusions in some cases. Thereby, we show that this false conclusion cannot be fixed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ardasnck/learning_to_localize_sound_source
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Advanced Vision and Imaging