On Residual CNN in text-dependent speaker verification task

Egor Malykh; Sergey Novoselov; Oleg Kudashev

arXiv:1705.10134·cs.SD·May 31, 2017

On Residual CNN in text-dependent speaker verification task

Egor Malykh, Sergey Novoselov, Oleg Kudashev

PDF

Open Access

TL;DR

This paper explores the use of deep residual CNNs with spectrogram inputs for text-dependent speaker verification, achieving promising results and significant improvements through system fusion.

Contribution

It introduces a residual CNN approach to speaker verification and evaluates its performance, demonstrating potential despite not surpassing the baseline alone.

Findings

01

Achieved 5.23% ERR on RSR2015 evaluation.

02

Fusion of systems improved performance by 18%.

03

Residual CNNs show promise for speaker verification.

Abstract

Deep learning approaches are still not very common in the speaker verification field. We investigate the possibility of using deep residual convolutional neural network with spectrograms as an input features in the text-dependent speaker verification task. Despite the fact that we were not able to surpass the baseline system in quality, we achieved a quite good results for such a new approach getting an 5.23% ERR on the RSR2015 evaluation part. Fusion of the baseline and proposed systems outperformed the best individual system by 18% relatively.

Tables2

Table 1. Table 1: Residual CNN architecture

layer	kernel/stride	output	#parameters
Input	$-$	$257 \times 800 \times 1$	$0$
Conv+BN+ReLU	$7 \times 7$ / $2 \times 2$	$129 \times 400 \times 64$	$3.2$ K
Maximum pooling	$3 \times 3$ / $2 \times 2$	$65 \times 200 \times 64$	$0$
Residual block	$3 \times 3$ / $1 \times 1$	$65 \times 200 \times 64$	$74.1$ K
	$3 \times 3$ / $1 \times 1$
Residual block	$3 \times 3$ / $1 \times 1$	$65 \times 200 \times 64$	$74.1$ K
	$3 \times 3$ / $1 \times 1$
Residual block	$3 \times 3$ / $2 \times 2$	$33 \times 100 \times 128$	$230.1$ K
	$3 \times 3$ / $1 \times 1$
Residual block	$3 \times 3$ / $1 \times 1$	$33 \times 100 \times 128$	$296.2$ K
	$3 \times 3$ / $1 \times 1$
Residual block	$3 \times 3$ / $2 \times 2$	$17 \times 50 \times 256$	$919.8$ K
	$3 \times 3$ / $1 \times 1$
Residual block	$3 \times 3$ / $1 \times 1$	$17 \times 50 \times 256$	$1 182.2$ K
	$3 \times 3$ / $1 \times 1$
Residual block	$3 \times 3$ / $2 \times 2$	$9 \times 25 \times 512$	$3 674.7$ K
	$3 \times 3$ / $1 \times 1$
Residual block	$3 \times 3$ / $1 \times 1$	$9 \times 25 \times 512$	$4 723.7$ K
	$3 \times 3$ / $1 \times 1$
Average pooling	$-$	$512$	$0$
SoftMax	$-$	$97$	$50$ K
Total			$11 228.0$ K

Table 2. Table 2: Evaluation results in terms of EER [%] and minDCF

System	EER	minDCF
Baseline	$0.79$	$0.23$
Deep CNN	$6.02$	$0.94$
Deep CNN (ext)	$5.23$	$0.92$
Fusion	0.64	0.18

Equations6

s = μ + T w,

s = μ + T w,

x_{l + 1} = x_{l} + F (x_{l}, W_{l}),

x_{l + 1} = x_{l} + F (x_{l}, W_{l}),

\overline{W} = W + \frac{1}{2} E,

\overline{W} = W + \frac{1}{2} E,

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

Full text

11institutetext: ITMO University, St.Petersburg, Russia

22institutetext: STC-innovations Ltd., St.Petersburg, Russia

22email: {malykh, novoselov, kudashev}@speechpro.com

On Residual CNN in text-dependent speaker verification task

Egor Malykh 11

Sergey Novoselov 11 2 2

Oleg Kudashev 11 2 2

Abstract

Deep learning approaches are still not very common in the speaker verification field. We investigate the possibility of using deep residual convolutional neural network with spectrograms as an input features in the text-dependent speaker verification task. Despite the fact that we were not able to surpass the baseline system in quality, we achieved a quite good results for such a new approach getting an 5.23% ERR on the RSR2015 evaluation part. Fusion of the baseline and proposed systems outperformed the best individual system by 18% relatively.

Keywords:

speaker verification, residual learning, CNN, FFT

1 Introduction

I-vector systems are well-known for being state-of-the-art solutions to the text-independent speaker verification task [1, 2, 3, 21]. Recently, the solution of this task has increasingly been considered from the perspective of deep learning approaches. For instance, ASR deep neural network (DNN) model [3, 22] divides the acoustic space into senone classes and discriminates the speakers in this space using the classic total variability (TV) model [1]. In such phonetic discriminative DNN based systems two main approaches can be distinguished. The first is to use DNN posteriors to calculate Baum-Welch statistics, and the second is to use the bottleneck features in combination with speaker specific features (MFCC) for training the full TV-UBM system. The second approach is considered the most robust to varying conditions [4].

As demonstrated by recent publications [6, 8, 9, 10, 23], substantial success of the state-of-the-art text-dependent verification systems is mainly due to the progress in text-independent speaker recognition task. Thus, the success of the phonetic discriminative DNN in such a task leads to attempts to use similar approach in text-dependent systems [11, 16, 5].

In parallel, there are several studies on the use of Deep-Learning approaches aiming to create an end-to-end solutions for discriminating speakers directly in a text-dependent task [14, 13]. Such approaches are easily applicable when the duration of the considered utterances is small, since they can be fed as an input of a deep architecture entirely, for example as a spectrogram.

A speaker discriminative approach is the most natural way for speaker verification. [12] describes a DNN for extracting a small speaker footprint which can be used to discriminate between speakers.

In this paper we investigate the deep residual CNN [15] for direct speaker discrimination. Unlike [14] we focus on the use of spectrograms instead of MFCC as the input features and deep but light residual architecture instead of VGG-like network as the mapping.

2 Baseline

A standard i-vector system is used as the baseline in our experiments. The i-vector system models a speech utterance as a low dimensional vector of channel- and speaker-dependent factors using total variability approach, as follows:

[TABLE]

where $s$ is the mean supervector, $\mu$ is the mean supervector of an Universal Background Model (UBM), $T$ is a low rank matrix and $w$ is the i-vector estimated using the Factor Analysis method [1].

We used implementation of the back-end from [16]. All i-vectors are length normalized and further regularized using the phrase-dependent Within-class Covariance Normalization (WCCN). A simple cosine distance scoring is used followed by phrase-dependent s-norm score normalization [10].

19 Mel-Frequency Cepstral Coefficients (MFCC) + log energy is used as the baseline features. They are normalized by mean and variance and augmented with $\Delta$ and $\Delta\Delta$ . For this system we did not apply voice activity detection.

3 CNN

3.1 Features

We use the normalized log power magnitude spectrum obtained via Fast Fourier Transform (FFT) as the input acoustic features for this system. Spectrograms are extracted with the following parameters: window size is $256$ , step size is $64$ and Blackman window function is used. Example of such spectrogram is shown in Figure 1.

The length of the spectrogram along the frequency axis is fixed, but the length along the time axis varies depending on the utterance. However, CNN requires a constant-size image as the input. In order to satisfy this requirement we use the following technique. Images longer than 800 pixels wide are cropped. Images shorter than 800 pixels wide are complimented to the right by their own copy. Such cropping and padding technique is illustrated in Figure 2.

3.2 Residual architecture

Spectrograms, being two-dimensional tensors, can be considered as images and can be processed by methods used for image processing. Currently, the best convolutional architecture for solving image processing tasks is a Residual CNN [15]. Residual architecture is described in [15, 20] as a stack of several residual units. Residual unit is a mapping

[TABLE]

where $x_{l}$ and $x_{l+1}$ are the unit’s input and output. $\mathcal{F}$ consists of two $3\times 3$ convolutions with weights $\mathcal{W}_{l}$ . Additive "shortcut connection" allows the network to satisfy the basic property: adding more layers does not lead to a degradation of the network. Thus, it becomes possible to train very deep networks with a size of 152 or more layers, as shown in the [15]. For this study, a network with 18 layers from [15] with modifications from [20] was used. Network architecture is shown in table 1. The structure of basic residual block is presented in figure 3.

4 Experimental setup

4.1 RSR2015 corpus

In our experiments we use the RSR2015 database [7]. The RSR2015 provides data for three main use-case verification scenarios:

•

unique pass-phrase: each client pronounces the same pass-phrase,

•

user-dependent pass-phrase: each client pronounces his or her own pass-phrase,

•

prompted text: each client pronounces a sentence prompted by the system.

In this paper, our focus is on the first use-case where each speaker pronounces a particular sentence. The RSR2015 database contains audio recordings from 300 speakers (143 female and 157 male). There are 9 sessions for each of the participants. Each session consists of 30 short sentences. The database is collected in the office environment using six different portable recording devices (four smartphones and two tablets). Each speaker was recorded using three random different devices out of the six.

The database is randomly split into three non-overlapping groups of speakers, one for background training, one for development stage and one for evaluation stage. The number of male/female speakers is balanced for each group: 50/47 in the background set, 50/47 in the development set and 57/49 in the evaluation set.

We use the background set only for training our speaker verification systems. The development set is used to estimate calibration and fusion parameters. All test trials are performed on the evaluation set.

We focuse only on the scenario where the speaker pronounces correct pass-phrase. All experiments are conducted according to the part 1 protocols of the RSR2015 database. We consider pooled male and female trials for system performance measure.

Extended training set which contains the background and development sets is used in additional experiment.

4.2 Baseline

Parameters of WCCN matrix and i-vector extractor are estimated using the background subset of the RSR2015 corpus only. As described in [16], we use the following representation of the WCCN matrix:

[TABLE]

where $E$ is the unit matrix of appropriate dimensionality. This trick helps to prevent an overfitting despite the small number of speakers in the background subset.

4.3 CNN

CNN is implemented using the Keras framework [17] on top of the TensorFlow [18] backend. ADAM optimizer [19] with learning rate set at $10^{-4}$ is used for training

Network is trained to discriminate between all speakers in training set using the softmax layer and categorical cross-entropy loss function. In the evaluation phase an output from the 512-dimensional (same as i-vector) penultimate layer is used as the embedding corresponding to the input utterance.

5 Results and discussion

The result of our research is presented in Table 2 in terms of the Equal Error Rate (EER) and the minimum detection cost function (minDCF) with $P_{\textrm{tar}}=10^{-3}$ . Baseline system demonstrated a very good result with an EER of less than 1% which is comparable with the result from [16]. Deep CNN system achieved an EER of 6.02%. Fusion of this two systems shows 18% relative improvement over the baseline system which is the evidence of the fact that classic i-vector systems and deep learning systems results in decorrelated embeddings and thus can be used together.

Relatively poor performance of the system under investigation can be explained by the small size of the training set (97 speakers). Such conditions leads to overfitting of discriminative model. The hypothesis is that the deep residual CNN requires much more data for training and expanding training set will lead to a significant increase in accuracy. Experiments on the extended training set (194 speakers) sustains it resulting in an 5.23% EER. We hope that deep learning approaches will be able to outperform the i-vector based systems in the future.

Figure 5 illustrates the projection of CNN embeddings of the 9 randomly chosen speakers on two principal axis using the Principal Component Analysis. DET-curves of the all considered methods are shown in Figure 4.

6 Conclusion

In this paper, we presented studies of deep residual CNN architecture in the task of text-dependent verification. Raw normalized spectrograms of speech signals is used as the input features. Experiments conducted on Part 1 of the RSR2015 database showed that despite the small amount of training data, it is possible to train a deep speaker embeddings extractor, which makes it possible to separate the speaker classes fairly well. Best achieved result of the individual system is an 5.23% EER.

We also showed that increasing the amount of training data leads to the expected strengthening of the extractor and improvement of the results. Our future work will be focused on the improving the quality of deep CNN based systems and bringing them to the level of baseline i-vector systems. It can be noted already that fusion of the deep CNN and i-vector extractors gives a good performance gain of 18% relative improvement.

7 Acknowledgements

This work was financially supported by the Ministry of Education and Science of the Russian Federation, contract 14.578.21.0126 (ID RFMEFI57815X0126).

Bibliography23

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., & Ouellet, P. (2011). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4), 788-798.
2[2] Kenny, P., Ouellet, P., Dehak, N., Gupta, V., & Dumouchel, P. (2008). A study of interspeaker variability in speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 16(5), 980-988.
3[3] Lei, Y., Scheffer, N., Ferrer, L., & Mc Laren, M. (2014, May). A novel scheme for speaker recognition using a phonetically-aware deep neural network. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on (pp. 1695-1699). IEEE.
4[4] Mc Laren, M., Lei, Y., & Ferrer, L. (2015, April). Advances in deep neural network approaches to speaker recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on (pp. 4814-4818). IEEE.
5[5] Bhattacharya, G., Alam, J., Stafylakis, T., & Kenny, P. Deep Neural Network based Text-Dependent Speaker Recognition: Preliminary Results.
6[6] Stafylakis, T., Kenny, P., Ouellet, P., Perez, J., Kockmann, M., & Dumouchel, P. (2013). Text-dependent speaker recognition using PLDA with uncertainty propagation. matrix, 500, 1.
7[7] Larcher, A., Lee, K. A., Ma, B., & Li, H. (2012, September). RSR 2015: Database for Text-Dependent Speaker Verification using Multiple Pass-Phrases. In INTERSPEECH (pp. 1580-1583).
8[8] Larcher, A., Lee, K. A., Ma, B., & Li, H. (2014). Text-dependent speaker verification: Classifiers, databases and RSR 2015. Speech Communication, 60, 56-77.