Unified Architecture and Unsupervised Speech Disentanglement for Speaker Embedding-Free Enrollment in Personalized Speech Enhancement

Ziling Huang; Haixin Guan; Yanhua Long

arXiv:2505.12288·eess.AS·May 20, 2025

Unified Architecture and Unsupervised Speech Disentanglement for Speaker Embedding-Free Enrollment in Personalized Speech Enhancement

Ziling Huang, Haixin Guan, Yanhua Long

PDF

Open Access

TL;DR

This paper introduces unified models for speech enhancement and personalized speech enhancement that incorporate unsupervised speech disentanglement, improving robustness and performance across varying enrollment speech conditions.

Contribution

The paper proposes two novel models, USEF-PNet and DSEF-PNet, unifying SE and PSE tasks and employing unsupervised speech disentanglement to enhance robustness against enrollment speech variations.

Findings

01

Models outperform previous methods on Libri2Mix and VoiceBank DEMAND datasets.

02

Unsupervised speech disentanglement reduces interference from emotion and content.

03

Long-short enrollment pairing improves performance regardless of enrollment duration.

Abstract

Conventional speech enhancement (SE) aims to improve speech perception and intelligibility by suppressing noise without requiring enrollment speech as reference, whereas personalized SE (PSE) addresses the cocktail party problem by extracting a target speaker's speech using enrollment speech. While these two tasks tackle different yet complementary challenges in speech signal processing, they often share similar model architectures, with PSE incorporating an additional branch to process enrollment speech. This suggests developing a unified model capable of efficiently handling both SE and PSE tasks, thereby simplifying deployment while maintaining high performance. However, PSE performance is sensitive to variations in enrollment speech, like emotional tone, which limits robustness in real-world applications. To address these challenges, we propose two novel models, USEF-PNet and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Face recognition and analysis