Weakly-supervised Audio-visual Sound Source Detection and Separation

Tanzila Rahman; Leonid Sigal

arXiv:2104.02606·cs.CV·April 7, 2021

Weakly-supervised Audio-visual Sound Source Detection and Separation

Tanzila Rahman, Leonid Sigal

PDF

TL;DR

This paper introduces a weakly-supervised audio-visual approach for localizing and separating object sounds in videos, leveraging object labels without bounding boxes, and demonstrates superior performance over existing methods.

Contribution

It presents an end-to-end trainable framework that combines weakly-supervised object segmentation with spectrogram mask prediction for sound separation, requiring no additional supervision.

Findings

01

Outperforms state-of-the-art on MUSIC dataset

02

Effective weakly-supervised learning without bounding boxes

03

Improves sound separation and denoising quality

Abstract

Learning how to localize and separate individual object sounds in the audio channel of the video is a difficult task. Current state-of-the-art methods predict audio masks from artificially mixed spectrograms, known as Mix-and-Separate framework. We propose an audio-visual co-segmentation, where the network learns both what individual objects look and sound like, from videos labeled with only object labels. Unlike other recent visually-guided audio source separation frameworks, our architecture can be learned in an end-to-end manner and requires no additional supervision or bounding box proposals. Specifically, we introduce weakly-supervised object segmentation in the context of sound separation. We also formulate spectrogram mask prediction using a set of learned mask bases, which combine using coefficients conditioned on the output of object segmentation , a design that facilitates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.