UNet-Based Fusion and Exponential Moving Average Adaptation for Noise-Robust Speaker Recognition
Chong-Xin Gan, Peter Bell, Man-Wai Mak, Zhe Li, Zezhong Jin, Zilong Huang, Kong Aik Lee

TL;DR
This paper introduces UF-EMA, a U-Net fusion framework with EMA adaptation, enhancing noise-robust speaker recognition by effectively leveraging noisy and enhanced speech inputs.
Contribution
It proposes a scalable U-Net fusion framework combined with EMA strategy, improving robustness and generalization in noisy speaker recognition tasks.
Findings
Outperforms existing methods on multiple noisy test sets.
Effectively exploits noisy and enhanced speech as multi-channel input.
EMA strategy mitigates overfitting of the speaker encoder.
Abstract
The joint training of speech enhancement and speaker embedding networks for speaker recognition is widely adopted under noisy acoustic environments. While effective, this paradigm often fails to leverage the generalization and robustness benefits inherent in large-scale speech enhancement pre-training. Moreover, maintaining the speaker information in the denoised speech is not an explicit objective of the speech enhancement process. To address these limitations, we proposed a scalable \textbf{U}Net-based \textbf{F}usion framework (UF-EMA) that considers the noisy and enhanced speech as a multi-channel input, thereby enabling the speaker encoder to exploit speaker information effectively. In addition, an \textbf{E}xponential \textbf{M}oving \textbf{A}verage strategy is applied to a speaker encoder pre-trained on clean speech to mitigate overfitting and facilitate a smooth transition from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
