ParaNoise-SV: Integrated Approach for Noise-Robust Speaker Verification with Parallel Joint Learning of Speech Enhancement and Noise Extraction

Minu Kim; Kangwook Jang; Hoirin Kim

arXiv:2508.07219·eess.AS·August 12, 2025

ParaNoise-SV: Integrated Approach for Noise-Robust Speaker Verification with Parallel Joint Learning of Speech Enhancement and Noise Extraction

Minu Kim, Kangwook Jang, Hoirin Kim

PDF

Open Access

TL;DR

ParaNoise-SV introduces a dual U-Net architecture that explicitly models noise and enhances speech, significantly improving noise-robust speaker verification performance by reducing error rates.

Contribution

It presents a novel parallel joint learning framework with explicit noise modeling, advancing noise robustness in speaker verification.

Findings

01

Achieves 8.4% lower EER than previous models

02

Effectively separates noise from speech during training

03

Enhances speaker verification accuracy in noisy environments

Abstract

Noise-robust speaker verification leverages joint learning of speech enhancement (SE) and speaker verification (SV) to improve robustness. However, prevailing approaches rely on implicit noise suppression, which struggles to separate noise from speaker characteristics as they do not explicitly distinguish noise from speech during training. Although integrating SE and SV helps, it remains limited in handling noise effectively. Meanwhile, recent SE studies suggest that explicitly modeling noise, rather than merely suppressing it, enhances noise resilience. Reflecting this, we propose ParaNoise-SV, with dual U-Nets combining a noise extraction (NE) network and a speech enhancement (SE) network. The NE U-Net explicitly models noise, while the SE U-Net refines speech with guidance from NE through parallel connections, preserving speaker-relevant features. Experimental results show that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Face recognition and analysis