Learning to Separate Object Sounds by Watching Unlabeled Video
Ruohan Gao, Rogerio Feris, Kristen Grauman

TL;DR
This paper introduces a novel deep learning framework that learns to separate object sounds in videos by leveraging visual context, enabling improved audio source separation without requiring labeled or isolated training data.
Contribution
It presents the first large-scale method for learning audio source separation from unlabeled, real-world videos using a multi-instance multi-label learning approach.
Findings
Achieved state-of-the-art results in visually-aided audio source separation.
Effectively disentangled audio frequency bases corresponding to visual objects.
Improved audio denoising performance in complex, multi-source videos.
Abstract
Perceiving a scene most fully requires all the senses. Yet modeling how objects look and sound is challenging: most natural scenes and events contain multiple objects, and the audio track mixes all the sound sources together. We propose to learn audio-visual object models from unlabeled video, then exploit the visual context to perform audio source separation in novel videos. Our approach relies on a deep multi-instance multi-label learning framework to disentangle the audio frequency bases that map to individual visual objects, even without observing/hearing those objects in isolation. We show how the recovered disentangled bases can be used to guide audio source separation to obtain better-separated, object-level sounds. Our work is the first to learn audio source separation from large-scale "in the wild" videos containing multiple audio sources per video. We obtain state-of-the-art…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
