Zero-Shot Personalized Speech Enhancement through Speaker-Informed Model   Selection

Aswin Sivaraman; Minje Kim

arXiv:2105.03542·eess.AS·May 11, 2021

Zero-Shot Personalized Speech Enhancement through Speaker-Informed Model Selection

Aswin Sivaraman, Minje Kim

PDF

Open Access

TL;DR

This paper introduces a zero-shot personalized speech enhancement method using a speaker-informed ensemble model, which efficiently adapts to unseen speakers by selecting specialized modules based on estimated speaker characteristics.

Contribution

It proposes a novel ensemble approach with a gating mechanism and speaker grouping via Siamese networks and clustering, enabling zero-shot adaptation without test-time data collection.

Findings

01

Ensemble models outperform single high-capacity models in personalized speech enhancement.

02

Speaker grouping via Siamese network and clustering improves module selection accuracy.

03

Low-capacity specialist modules achieve better efficiency and adaptation than generalist models.

Abstract

This paper presents a novel zero-shot learning approach towards personalized speech enhancement through the use of a sparsely active ensemble model. Optimizing speech denoising systems towards a particular test-time speaker can improve performance and reduce run-time complexity. However, test-time model adaptation may be challenging if collecting data from the test-time speaker is not possible. To this end, we propose using an ensemble model wherein each specialist module denoises noisy utterances from a distinct partition of training set speakers. The gating module inexpensively estimates test-time speaker characteristics in the form of an embedding vector and selects the most appropriate specialist module for denoising the test signal. Grouping the training set speakers into non-overlapping semantically similar groups is non-trivial and ill-defined. To do this, we first train a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Indoor and Outdoor Localization Technologies

MethodsSiamese Network · k-Means Clustering