TL;DR
This paper introduces a novel feature enhancement method using deep feature losses for speaker verification, improving robustness in noisy and real-world environments by optimizing enhancement networks in the hidden layers of a pre-trained speaker embedding model.
Contribution
It proposes a deep feature loss-based speech enhancement approach that enhances speaker verification performance in adverse conditions, a novel application of deep feature losses in this context.
Findings
Consistent performance improvements over state-of-the-art systems.
10.38% relative reduction in minDCF on BabyTrain corpus.
12.40% relative reduction in EER on BabyTrain corpus.
Abstract
Speaker Verification still suffers from the challenge of generalization to novel adverse environments. We leverage on the recent advancements made by deep learning based speech enhancement and propose a feature-domain supervised denoising based solution. We propose to use Deep Feature Loss which optimizes the enhancement network in the hidden activation space of a pre-trained auxiliary speaker embedding network. We experimentally verify the approach on simulated and real data. A simulated testing setup is created using various noise types at different SNR levels. For evaluation on real data, we choose BabyTrain corpus which consists of children recordings in uncontrolled environments. We observe consistent gains in every condition over the state-of-the-art augmented Factorized-TDNN x-vector system. On BabyTrain corpus, we observe relative gains of 10.38% and 12.40% in minDCF and EER…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
