SoundBeam: Target sound extraction conditioned on sound-class labels and   enrollment clues for increased performance and continuous learning

Marc Delcroix; Jorge Bennasar V\'azquez; Tsubasa Ochiai; Keisuke; Kinoshita; Yasunori Ohishi; Shoko Araki

arXiv:2204.03895·eess.AS·November 3, 2022

SoundBeam: Target sound extraction conditioned on sound-class labels and enrollment clues for increased performance and continuous learning

Marc Delcroix, Jorge Bennasar V\'azquez, Tsubasa Ochiai, Keisuke, Kinoshita, Yasunori Ohishi, Shoko Araki

PDF

Open Access

TL;DR

SoundBeam is a novel target sound extraction framework that combines class label conditioning and enrollment audio clues to improve performance and enable continuous learning of new sound classes.

Contribution

We introduce SoundBeam, a TSE framework that integrates class labels and enrollment audio, enhancing extraction flexibility and performance for known and new sound classes.

Findings

01

SoundBeam outperforms existing methods on synthesized and real mixtures.

02

It effectively handles new sound classes through enrollment-based learning.

03

The framework demonstrates significant improvements in target sound extraction accuracy.

Abstract

In many situations, we would like to hear desired sound events (SEs) while being able to ignore interference. Target sound extraction (TSE) tackles this problem by estimating the audio signal of the sounds of target SE classes in a mixture of sounds while suppressing all other sounds. We can achieve this with a neural network that extracts the target SEs by conditioning it on clues representing the target SE classes. Two types of clues have been proposed, i.e., target SE class labels and enrollment audio samples (or audio queries), which are pre-recorded audio samples of sounds from the target SE classes. Systems based on SE class labels can directly optimize embedding vectors representing the SE classes, resulting in high extraction performance. However, extending these systems to extract new SE classes not encountered during training is not easy. Enrollment-based approaches extract…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis