Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of   On-Screen Sounds

Efthymios Tzinis; Scott Wisdom; Aren Jansen; Shawn Hershey; Tal Remez,; Daniel P. W. Ellis; John R. Hershey

arXiv:2011.01143·cs.SD·June 1, 2021·29 cites

Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds

Efthymios Tzinis, Scott Wisdom, Aren Jansen, Shawn Hershey, Tal Remez,, Daniel P. W. Ellis, John R. Hershey

PDF

Open Access 1 Video

TL;DR

AudioScope is an unsupervised audio-visual framework that isolates on-screen sounds from wild videos across diverse sound classes without requiring labels or prior segmentation.

Contribution

It introduces a fully unsupervised method capable of handling open-domain sounds and variable sources, overcoming limitations of prior supervised approaches.

Findings

01

Effective separation of on-screen sounds in unconstrained videos

02

Operates without supervision or prior labels

03

Works across diverse sound classes in real-world data

Abstract

Recent progress in deep learning has enabled many advances in sound separation and visual scene understanding. However, extracting sound sources which are apparent in natural videos remains an open problem. In this work, we present AudioScope, a novel audio-visual sound separation framework that can be trained without supervision to isolate on-screen sound sources from real in-the-wild videos. Prior audio-visual separation work assumed artificial limitations on the domain of sound classes (e.g., to speech or music), constrained the number of sources, and required strong sound separation or visual segmentation labels. AudioScope overcomes these limitations, operating on an open domain of sounds, with variable numbers of sources, and without labels or prior visual segmentation. The training procedure for AudioScope uses mixture invariant training (MixIT) to separate synthetic mixtures of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds· slideslive

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Hearing Loss and Rehabilitation