SVVAD: Personal Voice Activity Detection for Speaker Verification

Zuheng Kang; Jianzong Wang; Junqing Peng; Jing Xiao

arXiv:2305.19581·cs.SD·June 1, 2023·1 cites

SVVAD: Personal Voice Activity Detection for Speaker Verification

Zuheng Kang, Jianzong Wang, Junqing Peng, Jing Xiao

PDF

Open Access

TL;DR

This paper introduces SVVAD, a novel voice activity detection framework tailored for speaker verification that adapts speech features and uses label-free training to improve accuracy in noisy and multi-speaker environments.

Contribution

The paper presents a speaker verification-based VAD method with a label-free training approach using triplet-like losses, enhancing robustness without relying on inaccurate labels.

Findings

01

SVVAD significantly reduces equal error rate in mixed speaker scenarios.

02

Decision boundaries align with human judgments of speech importance.

03

Outperforms baseline methods in noisy and multi-speaker conditions.

Abstract

Voice activity detection (VAD) improves the performance of speaker verification (SV) by preserving speech segments and attenuating the effects of non-speech. However, this scheme is not ideal: (1) it fails in noisy environments or multi-speaker conversations; (2) it is trained based on inaccurate non-SV sensitive labels. To address this, we propose a speaker verification-based voice activity detection (SVVAD) framework that can adapt the speech features according to which are most informative for SV. To achieve this, we introduce a label-free training method with triplet-like losses that completely avoids the performance degradation of SV due to incorrect labeling. Extensive experiments show that SVVAD significantly outperforms the baseline in terms of equal error rate (EER) under conditions where other speakers are mixed at different ratios. Moreover, the decision boundaries reveal the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Speech and dialogue systems