On Training Targets and Activation Functions for Deep Representation   Learning in Text-Dependent Speaker Verification

Achintya kr. Sarkar; Zheng-Hua Tan

arXiv:2201.06426·cs.SD·January 19, 2022

On Training Targets and Activation Functions for Deep Representation Learning in Text-Dependent Speaker Verification

Achintya kr. Sarkar, Zheng-Hua Tan

PDF

Open Access

TL;DR

This paper systematically evaluates the impact of training targets, activation functions, and loss functions on deep neural network performance in text-dependent speaker verification, highlighting the effectiveness of self-supervised methods and GELU activation.

Contribution

It provides a comprehensive analysis of training targets, loss functions, and activation functions, introducing the effectiveness of GELU and TCL in TD-SV systems.

Findings

01

GELU activation reduces error rates significantly.

02

Time-contrastive learning (TCL) outperforms other training targets.

03

Cross entropy, joint-softmax, and focal loss improve system accuracy.

Abstract

Deep representation learning has gained significant momentum in advancing text-dependent speaker verification (TD-SV) systems. When designing deep neural networks (DNN) for extracting bottleneck features, key considerations include training targets, activation functions, and loss functions. In this paper, we systematically study the impact of these choices on the performance of TD-SV. For training targets, we consider speaker identity, time-contrastive learning (TCL) and auto-regressive prediction coding with the first being supervised and the last two being self-supervised. Furthermore, we study a range of loss functions when speaker identity is used as the training target. With regard to activation functions, we study the widely used sigmoid function, rectified linear unit (ReLU), and Gaussian error linear unit (GELU). We experimentally show that GELU is able to reduce the error rates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing

MethodsFocal Loss