Zero-Shot Personalized Speech Enhancement through Speaker-Informed Model Selection
Aswin Sivaraman, Minje Kim

TL;DR
This paper introduces a zero-shot personalized speech enhancement method using a speaker-informed ensemble model, which efficiently adapts to unseen speakers by selecting specialized modules based on estimated speaker characteristics.
Contribution
It proposes a novel ensemble approach with a gating mechanism and speaker grouping via Siamese networks and clustering, enabling zero-shot adaptation without test-time data collection.
Findings
Ensemble models outperform single high-capacity models in personalized speech enhancement.
Speaker grouping via Siamese network and clustering improves module selection accuracy.
Low-capacity specialist modules achieve better efficiency and adaptation than generalist models.
Abstract
This paper presents a novel zero-shot learning approach towards personalized speech enhancement through the use of a sparsely active ensemble model. Optimizing speech denoising systems towards a particular test-time speaker can improve performance and reduce run-time complexity. However, test-time model adaptation may be challenging if collecting data from the test-time speaker is not possible. To this end, we propose using an ensemble model wherein each specialist module denoises noisy utterances from a distinct partition of training set speakers. The gating module inexpensively estimates test-time speaker characteristics in the form of an embedding vector and selects the most appropriate specialist module for denoising the test signal. Grouping the training set speakers into non-overlapping semantically similar groups is non-trivial and ill-defined. To do this, we first train a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Indoor and Outdoor Localization Technologies
MethodsSiamese Network · k-Means Clustering
