The Sound of Bounding-Boxes

Takashi Oya; Shohei Iwase; Shigeo Morishima

arXiv:2203.15991·cs.CV·March 31, 2022

The Sound of Bounding-Boxes

Takashi Oya, Shohei Iwase, Shigeo Morishima

PDF

TL;DR

This paper introduces an unsupervised approach for audio-visual sound source separation that detects objects and separates sounds without relying on pre-trained detectors, enabling broader applicability and comparable accuracy.

Contribution

The proposed method jointly learns object detection and sound separation in an unsupervised manner, removing the dependency on pre-trained object detectors and arbitrary category limitations.

Findings

01

Performs comparably to supervised methods in separation accuracy

02

Does not require pre-trained object detectors or category annotations

03

Applicable to arbitrary object categories without additional annotations

Abstract

In the task of audio-visual sound source separation, which leverages visual information for sound source separation, identifying objects in an image is a crucial step prior to separating the sound source. However, existing methods that assign sound on detected bounding boxes suffer from a problem that their approach heavily relies on pre-trained object detectors. Specifically, when using these existing methods, it is required to predetermine all the possible categories of objects that can produce sound and use an object detector applicable to all such categories. To tackle this problem, we propose a fully unsupervised method that learns to detect objects in an image and separate sound source simultaneously. As our method does not rely on any pre-trained detector, our method is applicable to arbitrary categories without any additional annotation. Furthermore, although being fully…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.