VoiceExtender: Short-utterance Text-independent Speaker Verification   with Guided Diffusion Model

Yayun He; Zuheng Kang; Jianzong Wang; Junqing Peng; Jing Xiao

arXiv:2310.04681·cs.SD·October 10, 2023

VoiceExtender: Short-utterance Text-independent Speaker Verification with Guided Diffusion Model

Yayun He, Zuheng Kang, Jianzong Wang, Junqing Peng, Jing Xiao

PDF

Open Access

TL;DR

VoiceExtender introduces a diffusion model-based approach to enhance short-utterance speaker verification, significantly improving accuracy by augmenting speech features guided by speaker embeddings.

Contribution

The paper presents a novel diffusion model architecture that leverages speaker embedding guidance to improve short-utterance speaker verification performance.

Findings

01

Achieves up to 46.1% relative EER reduction on VoxCeleb1 for 0.5s utterances.

02

Outperforms baseline methods across multiple short-utterance durations.

03

Demonstrates effectiveness of diffusion models in speech feature augmentation.

Abstract

Speaker verification (SV) performance deteriorates as utterances become shorter. To this end, we propose a new architecture called VoiceExtender which provides a promising solution for improving SV performance when handling short-duration speech signals. We use two guided diffusion models, the built-in and the external speaker embedding (SE) guided diffusion model, both of which utilize a diffusion model-based sample generator that leverages SE guidance to augment the speech features based on a short utterance. Extensive experimental results on the VoxCeleb1 dataset show that our method outperforms the baseline, with relative improvements in equal error rate (EER) of 46.1%, 35.7%, 10.4%, and 5.7% for the short utterance conditions of 0.5, 1.0, 1.5, and 2.0 seconds, respectively.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsDiffusion