AudioScopeV2: Audio-Visual Attention Architectures for Calibrated   Open-Domain On-Screen Sound Separation

Efthymios Tzinis; Scott Wisdom; Tal Remez; John R. Hershey

arXiv:2207.10141·cs.SD·July 22, 2022·1 cites

AudioScopeV2: Audio-Visual Attention Architectures for Calibrated Open-Domain On-Screen Sound Separation

Efthymios Tzinis, Scott Wisdom, Tal Remez, John R. Hershey

PDF

Open Access

TL;DR

AudioScopeV2 introduces advanced audio-visual attention architectures for improved on-screen sound separation, leveraging finer resolution, pre-training, and a new diverse dataset to enhance performance and scalability in real-world videos.

Contribution

The paper presents novel attention architectures, a calibration procedure, and a new dataset, significantly advancing on-screen sound separation in unconstrained video environments.

Findings

01

Enhanced separation accuracy over previous methods

02

Effective scaling to longer videos with separable architectures

03

Pre-training on audio alone boosts separation performance

Abstract

We introduce AudioScopeV2, a state-of-the-art universal audio-visual on-screen sound separation system which is capable of learning to separate sounds and associate them with on-screen objects by looking at in-the-wild videos. We identify several limitations of previous work on audio-visual on-screen sound separation, including the coarse resolution of spatio-temporal attention, poor convergence of the audio separation model, limited variety in training and evaluation data, and failure to account for the trade off between preservation of on-screen sounds and suppression of off-screen sounds. We provide solutions to all of these issues. Our proposed cross-modal and self-attention network architectures capture audio-visual dependencies at a finer resolution over time, and we also propose efficient separable variants that are capable of scaling to longer videos without sacrificing much…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Advanced Adaptive Filtering Techniques · Hearing Loss and Rehabilitation