Few-shot learning of new sound classes for target sound extraction

Marc Delcroix; Jorge Bennasar V\'azquez; Tsubasa Ochiai; Keisuke; Kinoshita; Shoko Araki

arXiv:2106.07144·eess.AS·June 15, 2021

Few-shot learning of new sound classes for target sound extraction

Marc Delcroix, Jorge Bennasar V\'azquez, Tsubasa Ochiai, Keisuke, Kinoshita, Shoko Araki

PDF

Open Access

TL;DR

This paper introduces a combined approach for target sound extraction that leverages both class labels and enrollment audio, enabling effective extraction of both seen and unseen sound classes, with further improvements via few-shot adaptation.

Contribution

It proposes a novel framework combining 1-hot and enrollment-based extraction, and introduces few-shot adaptation for new sound classes.

Findings

01

Effective extraction of unseen sound classes demonstrated.

02

Combined framework outperforms traditional methods on synthesized mixtures.

03

Few-shot adaptation improves performance on new classes.

Abstract

Target sound extraction consists of extracting the sound of a target acoustic event (AE) class from a mixture of AE sounds. It can be realized using a neural network that extracts the target sound conditioned on a 1-hot vector that represents the desired AE class. With this approach, embedding vectors associated with the AE classes are directly optimized for the extraction of sound classes seen during training. However, it is not easy to extend this framework to new AE classes, i.e. unseen during training. Recently, speech, music, or AE sound extraction based on enrollment audio of the desired sound offers the potential of extracting any target sound in a mixture given only a short audio signal of a similar sound. In this work, we propose combining 1-hot- and enrollment-based target sound extraction, allowing optimal performance for seen AE classes and simple extension to new classes.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis

MethodsAutoencoders