On Training Targets and Activation Functions for Deep Representation Learning in Text-Dependent Speaker Verification
Achintya kr. Sarkar, Zheng-Hua Tan

TL;DR
This paper systematically evaluates the impact of training targets, activation functions, and loss functions on deep neural network performance in text-dependent speaker verification, highlighting the effectiveness of self-supervised methods and GELU activation.
Contribution
It provides a comprehensive analysis of training targets, loss functions, and activation functions, introducing the effectiveness of GELU and TCL in TD-SV systems.
Findings
GELU activation reduces error rates significantly.
Time-contrastive learning (TCL) outperforms other training targets.
Cross entropy, joint-softmax, and focal loss improve system accuracy.
Abstract
Deep representation learning has gained significant momentum in advancing text-dependent speaker verification (TD-SV) systems. When designing deep neural networks (DNN) for extracting bottleneck features, key considerations include training targets, activation functions, and loss functions. In this paper, we systematically study the impact of these choices on the performance of TD-SV. For training targets, we consider speaker identity, time-contrastive learning (TCL) and auto-regressive prediction coding with the first being supervised and the last two being self-supervised. Furthermore, we study a range of loss functions when speaker identity is used as the training target. With regard to activation functions, we study the widely used sigmoid function, rectified linear unit (ReLU), and Gaussian error linear unit (GELU). We experimentally show that GELU is able to reduce the error rates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing
MethodsFocal Loss
