Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds
Efthymios Tzinis, Scott Wisdom, Aren Jansen, Shawn Hershey, Tal Remez,, Daniel P. W. Ellis, John R. Hershey

TL;DR
AudioScope is an unsupervised audio-visual framework that isolates on-screen sounds from wild videos across diverse sound classes without requiring labels or prior segmentation.
Contribution
It introduces a fully unsupervised method capable of handling open-domain sounds and variable sources, overcoming limitations of prior supervised approaches.
Findings
Effective separation of on-screen sounds in unconstrained videos
Operates without supervision or prior labels
Works across diverse sound classes in real-world data
Abstract
Recent progress in deep learning has enabled many advances in sound separation and visual scene understanding. However, extracting sound sources which are apparent in natural videos remains an open problem. In this work, we present AudioScope, a novel audio-visual sound separation framework that can be trained without supervision to isolate on-screen sound sources from real in-the-wild videos. Prior audio-visual separation work assumed artificial limitations on the domain of sound classes (e.g., to speech or music), constrained the number of sources, and required strong sound separation or visual segmentation labels. AudioScope overcomes these limitations, operating on an open domain of sounds, with variable numbers of sources, and without labels or prior visual segmentation. The training procedure for AudioScope uses mixture invariant training (MixIT) to separate synthetic mixtures of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Hearing Loss and Rehabilitation
