On deep speaker embeddings for text-independent speaker recognition
Sergey Novoselov, Andrey Shulipa, Ivan Kremnev, Alexandr Kozlov, Vadim, Shchemelinin

TL;DR
This paper explores the use of deep neural networks with angular softmax activation for text-independent speaker recognition, demonstrating improved accuracy and robustness over traditional methods.
Contribution
It introduces a discriminative training approach with angular softmax and residual deep architectures, outperforming previous systems and standard backends like LDA-PLDA.
Findings
Angular softmax improves discriminability of speaker embeddings.
Deep residual networks outperform shallow architectures.
Discriminative metric learning surpasses LDA-PLDA in accuracy.
Abstract
We investigate deep neural network performance in the textindependent speaker recognition task. We demonstrate that using angular softmax activation at the last classification layer of a classification neural network instead of a simple softmax activation allows to train a more generalized discriminative speaker embedding extractor. Cosine similarity is an effective metric for speaker verification in this embedding space. We also address the problem of choosing an architecture for the extractor. We found that deep networks with residual frame level connections outperform wide but relatively shallow architectures. This paper also proposes several improvements for previous DNN-based extractor systems to increase the speaker recognition accuracy. We show that the discriminatively trained similarity metric learning approach outperforms the standard LDA-PLDA method as an embedding backend.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSoftmax
