A Unified Deep Learning Framework for Short-Duration Speaker   Verification in Adverse Environments

Youngmoon Jung; Yeunju Choi; Hyungjun Lim; Hoirin Kim

arXiv:2010.02477·eess.AS·October 7, 2020

A Unified Deep Learning Framework for Short-Duration Speaker Verification in Adverse Environments

Youngmoon Jung, Yeunju Choi, Hyungjun Lim, Hoirin Kim

PDF

TL;DR

This paper presents a unified deep learning framework that enhances speaker verification robustness in adverse environments by integrating feature pyramid modules, self-adaptive VAD, and speech enhancement, effectively handling short speech, noise, reverberation, and long non-speech segments.

Contribution

It introduces a novel end-to-end deep learning framework combining FPM-based MSA, SAS-VAD, and speech enhancement, the first to unify these models for robust speaker verification.

Findings

01

Outperforms baseline i-vector and deep speaker embedding systems.

02

Effective in noisy and reverberant conditions on Korean indoor and VoxCeleb datasets.

03

Demonstrates robustness to short speech segments and long non-speech segments.

Abstract

Speaker verification (SV) has recently attracted considerable research interest due to the growing popularity of virtual assistants. At the same time, there is an increasing requirement for an SV system: it should be robust to short speech segments, especially in noisy and reverberant environments. In this paper, we consider one more important requirement for practical applications: the system should be robust to an audio stream containing long non-speech segments, where a voice activity detection (VAD) is not applied. To meet these two requirements, we introduce feature pyramid module (FPM)-based multi-scale aggregation (MSA) and self-adaptive soft VAD (SAS-VAD). We present the FPM-based MSA to deal with short speech segments in noisy and reverberant environments. Also, we use the SAS-VAD to increase the robustness to long non-speech segments. To further improve the robustness to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.