SEF-PNet: Speaker Encoder-Free Personalized Speech Enhancement with Local and Global Contexts Aggregation
Ziling Huang, Haixin Guan, Haoran Wei, Yanhua Long

TL;DR
SEF-PNet introduces a speaker encoder-free approach for personalized speech enhancement, utilizing local-global context aggregation and interactive speaker adaptation to improve performance without relying on pre-trained speaker models.
Contribution
The paper proposes SEF-PNet, a novel PSE model that eliminates the need for speaker encoders by leveraging enrollment speech and noisy mixtures with innovative context aggregation and adaptation techniques.
Findings
Outperforms baseline models on Libri2Mix dataset
Achieves state-of-the-art PSE performance
Effectively utilizes enrollment speech without speaker encoders
Abstract
Personalized speech enhancement (PSE) methods typically rely on pre-trained speaker verification models or self-designed speaker encoders to extract target speaker clues, guiding the PSE model in isolating the desired speech. However, these approaches suffer from significant model complexity and often underutilize enrollment speaker information, limiting the potential performance of the PSE model. To address these limitations, we propose a novel Speaker Encoder-Free PSE network, termed SEF-PNet, which fully exploits the information present in both the enrollment speech and noisy mixtures. SEF-PNet incorporates two key innovations: Interactive Speaker Adaptation (ISA) and Local-Global Context Aggregation (LCA). ISA dynamically modulates the interactions between enrollment and noisy signals to enhance the speaker adaptation, while LCA employs advanced channel attention within the PSE…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
MethodsSoftmax · Attention Is All You Need
