Text-Independent Speaker Verification Based on Deep Neural Networks and Segmental Dynamic Time Warping
Mohamed Adel, Mohamed Afify, Akram Gaballah

TL;DR
This paper introduces a novel text-independent speaker verification method combining deep neural network-derived d-vectors with segmental dynamic time warping, outperforming traditional i-vector and d-vector approaches on NIST 2008 data.
Contribution
The paper proposes integrating segmental dynamic time warping with d-vectors for improved speaker verification accuracy, demonstrating superior performance over existing methods.
Findings
Outperforms i-vector baseline with PLDA scores
Surpasses d-vector approach with local cosine and PLDA distances
Score fusion yields significant accuracy improvements
Abstract
In this paper we present a new method for text-independent speaker verification that combines segmental dynamic time warping (SDTW) and the d-vector approach. The d-vectors, generated from a feed forward deep neural network trained to distinguish between speakers, are used as features to perform alignment and hence calculate the overall distance between the enrolment and test utterances.We present results on the NIST 2008 data set for speaker verification where the proposed method outperforms the conventional i-vector baseline with PLDA scores and outperforms d-vector approach with local distances based on cosine and PLDA scores. Also score combination with the i-vector/PLDA baseline leads to significant gains over both methods.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Time Series Analysis and Forecasting
