Additive Margin SincNet for Speaker Recognition

Jo\~ao Ant\^onio Chagas Nunes; David Mac\^edo; Cleber Zanchettin

arXiv:1901.10826·eess.AS·October 15, 2019

Additive Margin SincNet for Speaker Recognition

Jo\~ao Ant\^onio Chagas Nunes, David Mac\^edo, Cleber Zanchettin

PDF

1 Repo

TL;DR

This paper introduces AM-SincNet, a novel speaker recognition model combining SincNet with an improved AM-Softmax loss, significantly enhancing accuracy by enforcing class separation.

Contribution

The paper proposes integrating AM-Softmax with SincNet for speaker recognition, demonstrating a substantial performance improvement over traditional methods.

Findings

01

Approximately 40% reduction in Frame Error Rate on TIMIT dataset

02

Effective class separation achieved by AM-Softmax in speaker recognition

03

Enhanced model performance over standard SincNet

Abstract

Speaker Recognition is a challenging task with essential applications such as authentication, automation, and security. The SincNet is a new deep learning based model which has produced promising results to tackle the mentioned task. To train deep learning systems, the loss function is essential to the network performance. The Softmax loss function is a widely used function in deep learning methods, but it is not the best choice for all kind of problems. For distance-based problems, one new Softmax based loss function called Additive Margin Softmax (AM-Softmax) is proving to be a better choice than the traditional Softmax. The AM-Softmax introduces a margin of separation between the classes that forces the samples from the same class to be closer to each other and also maximizes the distance between classes. In this paper, we propose a new approach for speaker recognition systems called…

Tables1

Table 1. TABLE I: SincNet and AM-SincNet Frame Error Rates ( % percent \% ) for TIMIT dataset.

Epoch	SincNet	AM-SincNet
Epoch	SincNet	m=0.35	m=0.40	m=0.45	m=0.50	m=0.55	m=0.60	m=0.65	m=0.70	m=0.75	m=0.80
0	97.25	98.77	98.76	98.71	99.06	98.08	99.13	98.14	97.65	98.21	98.78
16	55.32	56.70	57.93	57.29	58.37	54.09	56.44	54.69	57.23	60.98	55.65
32	50.29	44.20	46.37	44.57	43.46	44.23	45.56	49.98	44.84	44.32	48.68
48	46.67	41.99	39.88	45.43	40.54	40.49	39.17	41.25	38.87	37.95	42.45
64	45.40	41.51	38.05	42.05	38.02	38.13	37.45	36.83	38.86	37.36	37.34
80	43.49	36.30	36.37	36.57	34.89	36.34	36.99	34.47	34.11	34.72	34.51
96	44.83	34.37	34.11	33.50	33.68	36.82	33.41	33.07	33.13	34.00	34.14
…	…	…	…	…	…	…	…	…	…	…	…
320	46.39	28.76	28.21	27.82	27.37	28.82	27.40	27.54	27.90	29.39	28.32
336	47.93	27.92	28.73	29.00	27.42	27.50	27.18	27.54	30.00	27.60	28.68
352	44.64	29.22	27.57	27.07	27.86	27.81	28.28	27.92	29.76	26.95	30.85

Equations4

L oss = - \frac{1}{n} i = 1 \sum n l o g \frac{ϕ _{i}}{ϕ _{i} + \sum _{j = 1, j \neq = y_{i}}^{c} e x p ( s ( W _{j}^{T} f _{i} ))}

L oss = - \frac{1}{n} i = 1 \sum n l o g \frac{ϕ _{i}}{ϕ _{i} + \sum _{j = 1, j \neq = y_{i}}^{c} e x p ( s ( W _{j}^{T} f _{i} ))}

ϕ_{i} = e x p (s (W_{y_{i}}^{T} f_{i} - m))

ϕ_{i} = e x p (s (W_{y_{i}}^{T} f_{i} - m))

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

joaoantoniocn/AM-SincNet
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSoftmax

Full text

Additive Margin SincNet for Speaker Recognition

João Antônio Chagas Nunes

*Centro de Informática

Universidade Federal de Pernambuco

*50.740-560, Recife, PE, Brazil

[email protected]

David Macêdo

*Centro de Informática

Universidade Federal de Pernambuco

*50.740-560, Recife, PE, Brazil

[email protected]

Cleber Zanchettin

*Centro de Informática

Universidade Federal de Pernambuco

*50.740-560, Recife, PE, Brazil

[email protected]

Abstract

Speaker Recognition is a challenging task with essential applications such as authentication, automation, and security. The SincNet is a new deep learning based model which has produced promising results to tackle the mentioned task. To train deep learning systems, the loss function is essential to the network performance. The Softmax loss function is a widely used function in deep learning methods, but it is not the best choice for all kind of problems. For distance-based problems, one new Softmax based loss function called Additive Margin Softmax (AM-Softmax) is proving to be a better choice than the traditional Softmax. The AM-Softmax introduces a margin of separation between the classes that forces the samples from the same class to be closer to each other and also maximizes the distance between classes. In this paper, we propose a new approach for speaker recognition systems called AM-SincNet, which is based on the SincNet but uses an improved AM-Softmax layer. The proposed method is evaluated in the TIMIT dataset and obtained an improvement of approximately 40% in the Frame Error Rate compared to SincNet.

I Introduction

Speaker Recognition is an essential task with applications in biometric authentication, identification, and security among others [1]. The field is divided into two main subtasks: Speaker Identification and Speaker Verification. In Speaker Identification, given an audio sample, the model tries to identify to which one in a list of predetermined speakers the locution belongs. In the Speaker Verification, the model verifies if a sampled audio belongs to a given speaker or not. Most of the literature techniques to tackle this problem are based on $i$ -vectors methods [2], which extract features from the audio samples and classify the features using methods such as PLDA [3], heavy-tailed PLDA [4], and Gaussian PLDA [5].

Despite the advances in recent years [6, 7, 8, 9, 10, 11, 12], Speaker Recognition is still a challenging problem. In the past years, Deep Neural Networks (DNN) has been taking place on pattern recognition tasks and signal processing. Convolutional Neural Networks (CNN) have already show that they are the actual best choice to image classification, detection or recognition tasks. In the same way, DNN models are being used combined with the traditional approaches or in end-to-end approaches for Speaker Recognition tasks [13, 14, 15]. In hybrid approaches, it is common to use the DNN model to extract features from a raw audio sample and then encode it on embedding vectors with low-dimensionality which samples sharing common features with closer samples. Usually, the embedding vectors are classified using traditional approaches.

The difficult behind the Speaker Recognition tasks is that audio signals are complex to model in low and high-level features that are discriminant enough to distinguish different speakers. Methods that use handcrafted features can extract more human-readable features and have a more appealing approach because humans can see what the method is doing and which features are used to make the inference. Nevertheless, handcrafted features lack in power. In fact, while we know what patterns they are looking for, we have no guarantee that these patterns are the best for the job. On the other hand, approaches based on Deep Learning have the power to learn patterns that humans may not be able to understand, but usually get better results than traditional methods, despite having more computational cost to training.

A promising approach to Speaker Recognition based on Deep Learning is the SincNet model [17] that unifies the power of Deep Learning with the interpretability of the handcrafted features. SincNet uses a Deep Learning model to process raw audio samples and learn powerful features. Therefore, it replaces the first layer of the DNN model, which is responsible for the convolution with parametrized sinc functions. The parametrized sinc functions implement band-pass filters and are used to convolve the waveform audio signal to extract basic low-level features to be later processed by the deeper layers of the network. The use of the sinc functions helps the network to learn more relevant features and also improves the convergence time of the model as the sinc functions have significantly fewer parameters than the first layer of traditional DNN. At the top of the model, the SincNet uses a Softmax layer which is responsible for mapping the final features processed by the network into a multi-dimensional space corresponding to the different classes or speakers.

The Softmax function is usually used as the last layer of DNN models. The function is used to delimit a linear surface that can be used as a decision boundary to separate samples from different classes. Although the Softmax function works well on optimizing a decision boundary that can be used to separate the classes, it is not appropriate to minimize the distance from samples of the same class. These characteristics may spoil the model efficiency on tasks like Speaker Verification that require to measure the distance between the samples to make a decision. To deal with this problem, new approaches such as Additive Margin Softmax [16] (AM-Softmax) are being proposed. The AM-Softmax introduces an additive margin to the decision boundary which forces the samples to be closer to each other, maximizing the distance between the classes and at the same time minimizing the distance from samples of the same class.

In this paper, we propose a new method for Speaker Verification called Additive Margin SincNet (AM-SincNet) that is highly inspirited on the SincNet architecture and the AM-Softmax loss function. In order to validate our hypothesis, the proposed method is evaluated on the TIMIT [18] dataset based in the Frame Error Rate. The following sections are organized as: In Section II, we present the related works, the proposed method is introduced at Section III, Section IV explains how we built our experiments, the results are discussed at Section V, and finally at Section VI we made our conclusions.

II Related Work

For some time, $i$ -vectors [2] have been used as the state-of-the-art feature extraction method for speaker recognition tasks. Usually, the extracted features are classified using PLDA [3] or other similar techniques, such as heavy-tailed PLDA [4] and Gauss-PLDA [5]. The intuition behind these traditional methods and how they work can be better seem in [19]. Although they have been giving us some reasonable results, it is clear that there is still room for improvements [19].

Recently, neural networks and deep learning techniques have shown to be a particularly attractive choice when dealing with feature extraction and patterns recognition in the most variety of data [20, 21]. For instance, CNNs are proving to produce a high performance on image classification tasks. Moreover, deep learning architectures [22, 23] and hybrid systems [24, 25, 26, 27, 28] are higher quality results on processing audio signals than traditional approaches. As an example, [29] built a speaker verification framework based on the Inception-Resnet-v1 deep neural network architecture using the triplet loss function.

SincNet [17] is one of these innovative deep learning architecture for speaker recognition which uses parametrized sinc functions as a foundation to its first convolutional layer. Sinc functions are designed to process digital signals just like audio, and thus the use of them as the first convolutional layer helps to capture more meaningful features to the network. Additionally, the extracted features are also more human-readable than the ones obtained from ordinary convolutions.

Besides, the sinc functions reduce the number of parameters on the SincNet first layer because each sinc function of any size only have two parameters to learn against $L$ from the conventional convolutional filter, where $L$ is the size of the filter. As a result, the sinc functions enables the network to converge faster. Another advantage of the sinc functions is the fact that they are symmetric, which means that we can reduce the computational effort to process it on $50\%$ by simply calculating half of the filters and flipping it to the other side.

The first layer of SincNet is made by 80 filters of size 251, and then it has two more conventional convolutional layers of size five with 60 filters each. Normalization is also applied to the input samples and the convolutional layers, the traditional and the sinc one. After that, the result propagates to three more fully connected layers of size 2048, and it is normalized again. The hidden layers use the Leaky ReLU [30] as the activation function. The sinc convolutional layer is initialized using mel-scale cutoff frequencies. On the other hand, the traditional convolutional layers together with the fully connected layers are initialized using $Glorot$ scheme. Finally, a Softmax layer provides the set of posterior probabilities for the classification.

III Additive Margin SincNet

The AM-SincNet is built by replacing the softmax layer of the SincNet with the Additive Margin Softmax [16]. The Additive Margin Softmax (AM-Softmax) is a loss function derived from the original Softmax which introduces an additive margin to its decision boundary.

The additive margin works as a better class separator than the traditional decision boundary from Softmax. Furthermore, it also forces the samples from the same class to become closer to each other thus improving results for tasks such as classification and verification. The AM-Softmax equation is written as:

[TABLE]

In the above equation, W is the weight matrix, and $f_{i}$ is the input from the $i$ -th sample for the last fully connected layer. The $W_{y_{i}}^{T}f_{i}$ is also known as the target logit for the $i$ -th sample. The $s$ and $m$ are the parameters responsible for scaling and additive margin, respectively. Although the network can learn $s$ during the optimization process, this can make the convergence to be very slow. Thus, a smart choice is to follow [16] and set $s$ to be a fixed value. On the other hand, the $m$ parameter is fundamental and has to be chosen carefully. On our context, we assume that both $W$ and $f$ are normalized to one. Figure 1 shows a comparison between the traditional Softmax and the AM-Softmax.

The SincNet approach has shown high-grade results on the speaker recognition task. Indeed, its architecture has been compared against ordinary CNNs and several other well-known methods for speaker recognition and verification such as MFCC and FBANK, and, in every scenario, the SincNet has overcome alternative approaches. The SincNet most significant contribution was the usage of sinc functions as its first convolutional layer. Nevertheless, to calculate the posterior probabilities over the target speaker, SincNet applies the Softmax loss function which, despite being a reasonable choice, is not particularly capable of producing a sharp distinction among the class in the final layer. Thus, we have decided to replace the last layer of SincNet from Softmax to AM-Softmax. Figure 2 is a minor modification of the original SincNet image that can be found in [17] which shows the archtecture of the proposed AM-SincNet.

IV Experiments

The proposed method AM-SincNet has been evaluated on the well known TIMIT dataset [18], which contains audio samples from 630 different speakers of the eight main American dialects and where each speaker reads a few phonetically rich sentences. We used the same pre-processing procedures as [17]. For example, the non-speech interval from the beginning and the end of the sentences were removed. Following the same protocol of [17], we have used five utterances of each speaker for training the network and the remaining three for evaluation. Moreover, we also split the waveform of each audio sample into 200ms chunks with 10ms overlap, and then these chunks were used to feed the network.

For training, we configured the network to use the RMSprop as optimizer with mini-batches of size 128 along with a learning rate of $lr\!=\!0.001$ , $\alpha\!=\!0.95$ , and $\epsilon\!=\!10^{-7}$ . The AM-Softmax comes with two more parameters than the traditional Softmax, and the new parameters are the scaling factor $s$ and the margin size $m$ . As mentioned before, we set the scaling factor $s$ to a fixed value of 30 in order to speed up the network training. On the other hand, for the margin parameter $m$ we carefully did several experiments to evaluate the influence of it on the Frame Error Rate (FER).

We also have added an $epsilon$ constant of value $10^{-11}$ to the AM-Softmax equation in order to avoid a division by zero on the required places. For each one of the experiments, we trained the models for exactly 352 epochs as it appeared enough to exploit adequately the different training speed presented by both competing models. To run the experiments, we used an NVIDIA Titan XP GPU, and the training process lasts for about four days. The experiments performed by this paper may be reproduced by using the code that we made available online at the GitHub111https://github.com/joaoantoniocn/AM-SincNet.

V Results

Several experiments were made to evaluate the proposed method against the traditional SincNet approach. In every one of them, the proposed AM-SincNet has shown higher accurate results. The proposed AM-SincNet method requires two more parameters, the scaling parameter $s$ and the margin parameter $m$ . We have decided to use $s\!=\!30$ , and we have done experiments to evaluate the influence of the margin parameter $m$ on the Frame Error Rate.

The Table I shows the Frame Error Rate (FER) in percentage for the original SincNet and our proposed method over 352 epochs on the test data. To verify the influence of the margin parameter on the proposed method, we performed several experiments using different values of $m$ in the range $0.35\!\leq\!m\!\leq\!0.80$ . The table shows the results from the first 96 and the last 32 epochs in steps of 16. The best result from each epoch is highlighted in bold.

It is possible to see that traditional SincNet only gets better results than the proposed AM-SincNet on the first epochs when none of them have given proper training time yet. After that, on epoch 48, the original SincNet starts to converge with an FER around $46\%$ , while the proposed method keeps decreasing its error throughout training.

In the epoch 96, the proposed method has already an FER more than $26\%$ better than the original SincNet for almost every value of $m$ excluding $m\!=\!0.55$ . The difference keeps increasing over the epochs, and at epoch 352 the proposed method has an FER of $26.95\%$ ( $m\!=\!0.75$ ) against $44.64\%$ from SincNet, which means that at this epoch AM-SincNet has a Frame Error Rate approximately $40\%$ better than traditional SincNet. The Figure 3 plots the Frame Error Rate on the test data for both methods along the training epochs. For the AM-SincNet, we used the margin parameter $m\!=\!0.50$ .

From Table I, we can also see the impact of the margin parameter $m$ on our proposed method. It is possible to see that the FER calculated for $m\!=\!0.50$ got the lowest (best) value at the epochs 32 and 320. In the same way, $m\!=\!0.55$ and $m\!=\!0.60$ got the lowest values at epochs 16 and 336, respectively. The value $m\!=\!0.65$ scores the lowest result for epochs 64 and 96, while $m\!=\!0.70$ got the lowest score at epoch 80, and $m\!=\!0.75$ reached the lowest value of epochs 48 and 352.

The $m\!=\!0.35$ , $m\!=\!0.40$ , $m\!=\!0.45$ , and $m\!=\!0.80$ does not reach the lowest values of any epoch in this table. Although the results in Table I may indicate that there is a golden value of $m$ which brings the best Frame Error Rate for the experiments, in fact, the difference of the FER calculated among the epochs may not be so significant. Indeed, at the end of training, all of the experiments with the AM-SincNet seem to approximate the FER to a value around $27\%$ . In any case, AM-SincNet overcomes the baseline approach.

VI Conclusion

This paper has proposed a new approach for directly processing waveform audio that is highly inspirited in the neural network architecture SincNet and the Additive Margin Softmax loss function. The proposed method, AM-SincNet, has shown a Frame Error Rate about 40% smaller than the traditional SincNet. It shows that the loss function we use on a model can have a significant impact on the expected result.

From Figure 3, it is possible to notice that the FER ( $\%$ ) from the proposed method may not have converged yet on the last epochs. Thus, if the training had last more, we may have noticed an even more significant difference between both methods. The proposed method comes with two more parameters for setting when compared with the traditional SincNet, although the experiments made here show that these extra parameters can be fixed values without compromising the performance of the model.

For future work, we would like to test our method using different datasets such as VoxCeleb2 [22], which has over a million samples from over 6k speakers. If we increase the amount of data, the model may show a more significant result. We also intend to use more metrics such as the Classification Error Rate ( $\%$ ) (CER) and the Equal Error Rate ( $\%$ ) (EER) to compare the models.

Acknowledgment

This work was supported in part by CNPq and CETENE (Brazilian research agencies). We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan XP GPU used for this research.

Bibliography30

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] H. Beigi, Fundamentals of Speaker Recognition . Springer Publishing Company, Incorporated, 2011.
2[2] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech and Language Processing , 2010.
3[3] S. Prince and J. H. Elder, “Probabilistic linear discriminant analysis for inferences about identity,” 2007 IEEE 11th International Conference on Computer Vision , pp. 1–8, 2007.
4[4] P. Matejka, O. Glembek, F. Castaldo, M. J. Alam, O. Plchot, P. Kenny, L. Burget, and J. Cernocký, “Full-covariance ubm and heavy-tailed plda in i-vector speaker verification,” 05 2011, pp. 4828–4831.
5[5] S. Cumani, O. Plchot, and P. Laface, “Probabilistic linear discriminant analysis of i-vector posterior distributions.” in ICASSP . IEEE, 2013, pp. 7644–7648. [Online]. Available: http://dblp.uni-trier.de/db/conf/icassp/icassp 2013.html#Cumani PL 13
6[6] W. Campbell, D. Sturim, D. Reynolds, and A. Solomonoff, “Svm based speaker verification using a gmm supervector kernel and nap variability compensation,” vol. 1, 06 2006, pp. I – I.
7[7] P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel, “Joint factor analysis versus eigenchannels in speaker recognition,” Audio, Speech, and Language Processing, IEEE Transactions on , vol. 15, pp. 1435 – 1447, 06 2007.
8[8] S. Cumani, O. Plchot, and P. Laface, “Probabilistic linear discriminant analysis of i-vector posterior distributions,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing , May 2013, pp. 7644–7648.