Back-ends Selection for Deep Speaker Embeddings
Zhuo Li, Runqiu Xiao, Zihan Zhang, Zhenduo Zhao, Wenchao Wang,, Pengyuan Zhang

TL;DR
This paper systematically compares cosine similarity and PLDA back-ends for deep speaker embeddings, finding that cosine excels in same-domain scenarios while PLDA is preferable for cross-domain situations, based on extensive experiments.
Contribution
It provides a comprehensive analysis and practical guidelines for selecting back-ends for deep speaker embeddings in various domain conditions.
Findings
Cosine similarity outperforms PLDA in same-domain scenarios.
PLDA yields better results in cross-domain situations.
Experimental validation on VoxCeleb and NIST SRE datasets supports the conjecture.
Abstract
Probabilistic Linear Discriminant Analysis (PLDA) was the dominant and necessary back-end for early speaker recognition approaches, like i-vector and x-vector. However, with the development of neural networks and margin-based loss functions, we can obtain deep speaker embeddings (DSEs), which have advantages of increased inter-class separation and smaller intra-class distances. In this case, PLDA seems unnecessary or even counterproductive for the discriminative embeddings, and cosine similarity scoring (Cos) achieves better performance than PLDA in some situations. Motivated by this, in this paper, we systematically explore how to select back-ends (Cos or PLDA) for deep speaker embeddings to achieve better performance in different situations. By analyzing PLDA and the properties of DSEs extracted from models with different numbers of segment-level layers, we make the conjecture that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
