Multi-Task Learning with High-Order Statistics for X-vector based Text-Independent Speaker Verification
Lanhua You, Wu Guo, Lirong Dai, Jun Du

TL;DR
This paper introduces a multi-task learning approach for x-vector speaker verification that incorporates high-order statistical reconstruction to enhance embedding robustness and discriminability, showing improved results on standard datasets.
Contribution
The paper proposes a novel multi-task training framework combining classification and statistical reconstruction to improve x-vector embeddings for speaker verification.
Findings
Outperforms original x-vector approach on NIST SRE16 and VOiCES datasets.
Achieves higher discriminability and robustness with minimal additional complexity.
Demonstrates effectiveness of high-order statistics in speaker embedding training.
Abstract
The x-vector based deep neural network (DNN) embedding systems have demonstrated effectiveness for text-independent speaker verification. This paper presents a multi-task learning architecture for training the speaker embedding DNN with the primary task of classifying the target speakers, and the auxiliary task of reconstructing the first- and higher-order statistics of the original input utterance. The proposed training strategy aggregates both the supervised and unsupervised learning into one framework to make the speaker embeddings more discriminative and robust. Experiments are carried out using the NIST SRE16 evaluation dataset and the VOiCES dataset. The results demonstrate that our proposed method outperforms the original x-vector approach with very low additional complexity added.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
