The Sound of Bounding-Boxes
Takashi Oya, Shohei Iwase, Shigeo Morishima

TL;DR
This paper introduces an unsupervised approach for audio-visual sound source separation that detects objects and separates sounds without relying on pre-trained detectors, enabling broader applicability and comparable accuracy.
Contribution
The proposed method jointly learns object detection and sound separation in an unsupervised manner, removing the dependency on pre-trained object detectors and arbitrary category limitations.
Findings
Performs comparably to supervised methods in separation accuracy
Does not require pre-trained object detectors or category annotations
Applicable to arbitrary object categories without additional annotations
Abstract
In the task of audio-visual sound source separation, which leverages visual information for sound source separation, identifying objects in an image is a crucial step prior to separating the sound source. However, existing methods that assign sound on detected bounding boxes suffer from a problem that their approach heavily relies on pre-trained object detectors. Specifically, when using these existing methods, it is required to predetermine all the possible categories of objects that can produce sound and use an object detector applicable to all such categories. To tackle this problem, we propose a fully unsupervised method that learns to detect objects in an image and separate sound source simultaneously. As our method does not rely on any pre-trained detector, our method is applicable to arbitrary categories without any additional annotation. Furthermore, although being fully…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
