Mix and Localize: Localizing Sound Sources in Mixtures
Xixi Hu, Ziyang Chen, Andrew Owens

TL;DR
This paper introduces a novel joint audio-visual localization method that groups and localizes multiple sound sources in scenes, outperforming existing self-supervised approaches.
Contribution
It proposes a unified framework using a contrastive random walk on a graph to simultaneously localize and associate multiple sounds with visual signals.
Findings
Successfully localizes multiple sounds in scenes.
Outperforms other self-supervised methods in experiments.
Works with musical instruments and speech.
Abstract
We present a method for simultaneously localizing multiple sound sources within a visual scene. This task requires a model to both group a sound mixture into individual sources, and to associate them with a visual signal. Our method jointly solves both tasks at once, using a formulation inspired by the contrastive random walk of Jabri et al. We create a graph in which images and separated sounds correspond to nodes, and train a random walker to transition between nodes from different modalities with high return probability. The transition probabilities for this walk are determined by an audio-visual similarity metric that is learned by our model. We show through experiments with musical instruments and human speech that our model can successfully localize multiple sounds, outperforming other self-supervised methods. Project site: https://hxixixh.github.io/mix-and-localize
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Music Technology and Sound Studies
