Self-Adaptive Soft Voice Activity Detection using Deep Neural Networks for Robust Speaker Verification
Youngmoon Jung, Yeunju Choi, Hoirin Kim

TL;DR
This paper introduces a self-adaptive soft voice activity detection method that integrates deep neural networks into speaker verification systems, improving robustness in real-world environments through domain adaptation techniques.
Contribution
It proposes a novel self-adaptive soft VAD approach combining soft feature selection with unsupervised domain adaptation for enhanced speaker verification.
Findings
Significant performance improvement in real-world environments.
Effective domain adaptation with speech posterior-based and joint learning schemes.
Enhanced robustness of speaker verification systems.
Abstract
Voice activity detection (VAD), which classifies frames as speech or non-speech, is an important module in many speech applications including speaker verification. In this paper, we propose a novel method, called self-adaptive soft VAD, to incorporate a deep neural network (DNN)-based VAD into a deep speaker embedding system. The proposed method is a combination of the following two approaches. The first approach is soft VAD, which performs a soft selection of frame-level features extracted from a speaker feature extractor. The frame-level features are weighted by their corresponding speech posteriors estimated from the DNN-based VAD, and then aggregated to generate a speaker embedding. The second approach is self-adaptive VAD, which fine-tunes the pre-trained VAD on the speaker verification data to reduce the domain mismatch. Here, we introduce two unsupervised domain adaptation (DA)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
