Weakly-supervised Audio-visual Sound Source Detection and Separation
Tanzila Rahman, Leonid Sigal

TL;DR
This paper introduces a weakly-supervised audio-visual approach for localizing and separating object sounds in videos, leveraging object labels without bounding boxes, and demonstrates superior performance over existing methods.
Contribution
It presents an end-to-end trainable framework that combines weakly-supervised object segmentation with spectrogram mask prediction for sound separation, requiring no additional supervision.
Findings
Outperforms state-of-the-art on MUSIC dataset
Effective weakly-supervised learning without bounding boxes
Improves sound separation and denoising quality
Abstract
Learning how to localize and separate individual object sounds in the audio channel of the video is a difficult task. Current state-of-the-art methods predict audio masks from artificially mixed spectrograms, known as Mix-and-Separate framework. We propose an audio-visual co-segmentation, where the network learns both what individual objects look and sound like, from videos labeled with only object labels. Unlike other recent visually-guided audio source separation frameworks, our architecture can be learned in an end-to-end manner and requires no additional supervision or bounding box proposals. Specifically, we introduce weakly-supervised object segmentation in the context of sound separation. We also formulate spectrogram mask prediction using a set of learned mask bases, which combine using coefficients conditioned on the output of object segmentation , a design that facilitates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
