Towards Robust Speaker Verification with Target Speaker Enhancement
Chunlei Zhang, Meng Yu, Chao Weng, Dong Yu

TL;DR
This paper introduces TASE-SVNet, a neural network that combines target speaker enhancement and embedding extraction to improve robustness in speaker verification, especially in noisy and overlapped speech scenarios.
Contribution
The paper presents a novel neural model with a speaker-conditioned enhancement front-end, nontarget speaker sampling, teacher-student training for lightweight embedding, and iterative inference for noisy enrollment.
Findings
Significant EER reduction over baselines in overlapped speech scenarios.
Effective nontarget speaker suppression improves verification accuracy.
Iterative inference enhances robustness in noisy environments.
Abstract
This paper proposes the target speaker enhancement based speaker verification network (TASE-SVNet), an all neural model that couples target speaker enhancement and speaker embedding extraction for robust speaker verification (SV). Specifically, an enrollment speaker conditioned speech enhancement module is employed as the front-end for extracting target speaker from its mixture with interfering speakers and environmental noises. Compared with the conventional target speaker enhancement models, nontarget speaker/interference suppression should draw additional attention for SV. Therefore, an effective nontarget speaker sampling strategy is explored. To improve speaker embedding extraction with a light-weighted model, a teacher-student (T/S) training is proposed to distill speaker discriminative information from large models to small models. Iterative inference is investigated to address…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
