Spatial Pyramid Encoding with Convex Length Normalization for Text-Independent Speaker Verification
Youngmoon Jung, Younggwan Kim, Hyungjun Lim, Yeunju Choi, Hoirin Kim

TL;DR
This paper introduces a novel spatial pyramid encoding pooling method combined with deep length normalization to improve speaker embeddings for text-independent speaker verification, demonstrating superior performance on VoxCeleb1.
Contribution
The paper proposes a new pooling technique called spatial pyramid encoding with convex length normalization, enhancing speaker verification accuracy.
Findings
Outperforms i-vector and d-vector baselines on VoxCeleb1
Generates fixed-dimensional embeddings from variable-length speech
Effectively normalizes embeddings using ring loss
Abstract
In this paper, we propose a new pooling method called spatial pyramid encoding (SPE) to generate speaker embeddings for text-independent speaker verification. We first partition the output feature maps from a deep residual network (ResNet) into increasingly fine sub-regions and extract speaker embeddings from each sub-region through a learnable dictionary encoding layer. These embeddings are concatenated to obtain the final speaker representation. The SPE layer not only generates a fixed-dimensional speaker embedding for a variable-length speech segment, but also aggregates the information of feature distribution from multi-level temporal bins. Furthermore, we apply deep length normalization by augmenting the loss function with ring loss. By applying ring loss, the network gradually learns to normalize the speaker embeddings using model weights themselves while preserving convexity,…
| stage | output size | ResNet-34 |
|---|---|---|
| conv1 | , stride | |
| conv2 | ||
| conv3 | ||
| conv4 | ||
| conv5 |
| Pooling | EER (%) | DCF | DCF |
|---|---|---|---|
| TAP | 4.62 | 0.460 | 0.581 |
| LDE | 4.33 | 0.435 | 0.549 |
| 2D-SPP | 4.59 | 0.452 | 0.573 |
| 1D-SPP | 4.50 | 0.447 | 0.564 |
| 2D-SPE | 4.29 | 0.428 | 0.534 |
| 1D-SPE | 4.20 | 0.422 | 0.528 |
| Loss Norm | EER (%) | DCF | DCF | |
|---|---|---|---|---|
| SM | - | 6.87 | 0.538 | 0.708 |
| -Cons SM | 12 (F) | 4.83 | 0.479 | 0.572 |
| -Cons SM | 24.1 (L) | 5.13 | 0.498 | 0.601 |
| SM + Ring | 20.5 (L) | 4.62 | 0.460 | 0.581 |
| ASM | - | 4.88 | 0.499 | 0.597 |
| -Cons ASM | 30 (F) | 4.69 | 0.478 | 0.584 |
| -Cons ASM | 28.3 (L) | 4.73 | 0.475 | 0.594 |
| ASM + Ring | 24.8 (L) | 4.41 | 0.451 | 0.559 |
| Systems | Loss Norm | Pooling | Scoring | EER (%) |
|---|---|---|---|---|
| i-vector [35] | - | - | PLDA | 5.4 |
| VGG-M [14] | Contrastive | TAP | Cosine | 7.8 |
| VGG (1D) [35] | SM | SP | PLDA | 5.3 |
| VGG-13 [36] | Center | TAP | Cosine | 4.9 |
| ResNet-34 [18] | ASM | TAP | PLDA | 4.46 |
| ResNet-34 [18] | ASM | SAP | PLDA | 4.40 |
| ResNet-34 [18] | ASM | LDE | PLDA | 4.48 |
| ResNet-34 [20] | -Cons SM | TAP | PLDA | 4.74 |
| Proposed | ASM + R | SPE | Cosine | 4.03 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Spatial Pyramid Encoding with Convex Length Normalization for Text-Independent Speaker Verification
Abstract
In this paper, we propose a new pooling method called spatial pyramid encoding (SPE) to generate speaker embeddings for text-independent speaker verification. We first partition the output feature maps from a deep residual network (ResNet) into increasingly fine sub-regions and extract speaker embeddings from each sub-region through a learnable dictionary encoding layer. These embeddings are concatenated to obtain the final speaker representation. The SPE layer not only generates a fixed-dimensional speaker embedding for a variable-length speech segment, but also aggregates the information of feature distribution from multi-level temporal bins. Furthermore, we apply deep length normalization by augmenting the loss function with ring loss. By applying ring loss, the network gradually learns to normalize the speaker embeddings using model weights themselves while preserving convexity, leading to more robust speaker embeddings. Experiments on the VoxCeleb1 dataset show that the proposed system using the SPE layer and ring loss-based deep length normalization outperforms both i-vector and d-vector baselines.
Index Terms: speaker verification, spatial pyramid encoding, learnable dictionary encoding, ring loss, length normalization
1 Introduction
Speaker verification (SV) is the task of verifying a person’s claimed identity based on his or her voice. Depending on the lexicon constraint on the spoken content, the SV systems can be classified into two categories, text-dependent speaker verification (TD-SV) and text-independent speaker verification (TI-SV). TD-SV requires the content of input speech to be fixed, while TI-SV operates on unconstrained speech.
The combination of i-vector [1] and probabilistic linear discriminant analysis (PLDA) [2] has been the dominant approach for TI-SV tasks [3, 4]. Recently, a deep neural network (DNN) trained for automatic speech recognition (ASR) was integrated into the i-vector system, which improved the conventional Gaussian Mixture Model-Universal Background Model (GMM-UBM) based i-vector system [5, 6]. However, the use of the additional ASR-DNN drastically increases the computational complexity and also requires transcribed data for training.
Another deep learning-based approach is to extract speaker embeddings directly from a speaker discriminative network [7, 8, 9, 10, 11]. In such systems, the network is trained to classify speakers in the training set, or to separate same-speaker and different-speaker utterance pairs. After training, the utterance-level speaker embeddings (called d-vectors) are obtained by aggregating the frame-level features extracted from the network.
Most d-vector based SV systems use a pooling mechanism to map a variable-length segment to a fixed-dimensional embedding vector. Average pooling is the most common method to extract the utterance-level speaker representations [12, 13, 14]. Recently, some researchers have proposed more advanced pooling methods. Snyder et al. [15] introduced the statistics pooling layer in which the standard deviation is used as well as the mean. Okabe et al. [16] combined the attention mechanism and the statistics pooling layer to propose attentive statistics pooling layer. Zhang et al. [9] proposed to replace the average pooling layer with the spatial pyramid pooling (SPP) layer [17] to maintain spatial information by pooling in local spatial bins. Cai et al. [18] applied the learnable dictionary encoding (LDE) scheme for extracting speaker embeddings. They imitated the process of encoding GMM supervectors within a deep learning framework. These approaches improved the performance over simple average pooling.
Once i-vectors or d-vectors are extracted, we usually apply length normalization for the speaker representations to have unit norm [19, 13]. In [20], the authors introduced -constraint based deep length normalization. They added an -normalization layer followed by a scale layer to constrain the representations to lie on a hypersphere of a fixed radius. They showed that integrating this simple step in the training pipeline boosts the performance of speaker verification.
In this work, we propose a new pooling scheme, called spatial pyramid encoding (SPE). After the frame-level features are extracted from ResNet [21], we divide the feature maps of the last layer into uniform grids at different scales. Unlike using the average pooling operation in the SPP layer, we extract embeddings from each sub-region through the LDE layer. The final speaker representation is produced by aggregating the embeddings from each sub-region. Furthermore, we apply convex length normalization using ring loss [22] to normalize the speaker embedding. We show that ring loss-based deep length normalization performs better than the -constraint based one.
In this paper, we first describe the d-vector systems in Section 2. Section 3 reviews the related prior works. Section 4 presents our proposed methods. The experimental setup and results are described in Section 5 and Section 6, respectively. We conclude this work in Section 7.
2 d-vector systems
We can classify d-vector based SV systems according to the loss function used. The first one is based on the softmax loss defined in [23] as the combination of a cross-entropy loss, a softmax function and the last fully connected layer [7, 8, 24]. In this system, a speaker classifier is trained to classify speakers in the training set. The softmax loss encourages the separability of speaker embeddings. However, the softmax loss is not sufficient to learn the discriminative embedding with a large margin, and more researchers began to explore discriminative loss functions for enhanced generalization ability.
Another type of system is based on the triplet loss [9] which enhances the intra-class compactness and inter-class separability, leading to better generalization ability. It minimizes the distance between embedding pairs from the same speaker and maximizes the distance between pairs from different speakers. A drawback is that it requires the careful selection of triplets of samples, which is time-consuming and performance-sensitive.
To circumvent the triplet-wise computation and learn more discriminative representations, the center loss [25] and angular softmax (A-softmax) loss [26] are applied to SV tasks, respectively [10, 11]. The center loss minimizes the Euclidean distance between the embeddings and the corresponding class centroids. The angular softmax loss introduces an angular margin into the softmax loss through the designing of a sophisticated differentiable angular distance function. The hyperparameter controls the size of the angular margin. Large gives more stringent constraint on the distribution of the deep embeddings and enforces a larger angular margin between classes.
For all the systems mentioned above, the frame-level features are extracted from the speaker discriminative network. Then, the d-vector is obtained by a pooling layer that aggregates the frame-level features across time. The speaker-dependent d-vector for each enrollment speaker is stored after the d-vector is divided by its -norm for length normalization. Finally, scoring between enrollment and test d-vector is performed using either the cosine distance or PLDA.
3 Prior works
3.1 Learnable dictionary encoding layer
Cai et al. [18] employed the learnable dictionary encoding (LDE) layer [27] for speaker recognition. The LDE layer acts as a pooling layer integrated on top of convolutional layers, which ports the entire dictionary learning and encoding pipeline into a single model. It accepts variable-length inputs and produces fixed-length speaker embeddings. We assume that frame-level features are distributed in codewords and the LDE layer learns a dictionary, a set of codewords. This is essentially the same as the conventional GMM supervector.
The LDE layer considers an input feature map with the shape of as a set of -dimensional input features , where is the total number of features given by , which learns an inherent codebook containing number of codewords and a set of smoothing factor of the codewords . The residual encoding for codeword is generated by aggregating the residuals with soft-assignment weights:
[TABLE]
where the residuals are given by . The assigning weight is given by a softmax function as follows:
[TABLE]
The LDE layer concatenates the residual encoding vectors, generating a fixed-length representation (independent of the number of input features ). The resulting vector has the same role as the supervector in the GMM supervector approach. Finally, this supervector is projected to a lower dimension to obtain the final embedding through an additional fully connected (FC) layer. This projection has the same role as the total variability matrix of the i-vector system.
3.2 -constraint based deep length normalization
Cai et al. [20] applied an -constraint [28] to the speaker embedding during training. As shown in Figure 1, they added an -normalization layer followed by a scale layer to constrain the speaker embedding to lie on a hypersphere of a fixed radius.
This module is added just after the penultimate layer of the network which is the pooling layer. The -normalization layer normalizes the input speaker embedding to a unit vector. The scale layer scales the unit-length embedding vector into a fixed radius given by the parameter . They showed that this simple step in the training pipeline boosts the performance of speaker verification systems.
4 Proposed approaches
4.1 Spatial pyramid encoding layer
Figure 2 shows the proposed pooling layer, called the spatial pyramid encoding (SPE) layer. First, the 34-layer ResNet is used to extract frame-level features from utterances, which has been widely used in previous studies [13, 18, 20, 29]. The architecture is described in Table 1. The ResNet takes log Mel-filterbank (Fbank) features of size and outputs frame-level features of size . The resulting feature maps are fed into the SPE layer and then aggregated into a single, utterance-level speaker representation.
The SPE method includes three steps. In the first step, the input feature maps are divided into increasingly finer sub-regions along the time axis, forming a pyramid of sub-feature maps. This operation is called the spatial pyramid division (SPD). In this work, we apply the pyramids with two levels {, } (totally 5 bins). Subsequently, a convolutional layer is used for each bin, reducing the number of channels from 256 to 64. After that, we extract speaker embeddings from each bin through the LDE layer with 64 codewords, followed by -normalization and an FC layer. This FC layer reduces the dimension of the embeddings from 4,096 (= 64 64) to 256. Here, the LDE layer is shared across all bins. At last, all the local embeddings are concatenated and passed through an FC layer with 256 neurons to form the final speaker embedding.
The SPE layer can be viewed as a combination of the LDE layer and spatial pyramid pooling (SPP) [17] layer. SPP (also known as spatial pyramid matching or SPM [30]), as an extension of the bag-of-words (BoW) model [31], has been widely used in the computer vision community. It partitions an image into several segments in different scales, then computes the BoW histograms [30] or GMM supervectors [32] of local features in each segment. The resulting vectors for all the segments are concatenated to form a high dimensional vector representation of the image. SPP enables us to incorporate the spatial information of feature vectors. He et al. [17] proposed SPP-net in which the SPP layer is used to replace the last pooling layer of the convolutional neural network (CNN). Later, the SPP layer was applied to speaker verification tasks [9]. In the SPP layer, the last convolutional feature maps are divided into sub-regions, and then average pooling is applied to each sub-region.
The proposed SPE layer replaces the simple average pooling operation of the SPP layer with the LDE operation which is found to perform better for speaker verification task in [18]. Therefore, the SPE layer can be seen as the extension of the SPP layer. At the same time, we can also view the SPE layer as the extension of the LDE layer. The descriptive power of the LDE layer is limited because it discards the temporal information of local CNN features. This motivates us to combine temporal information with the LDE layer. The SPE layer enhances the LDE layer by taking the temporal information into consideration at both local and global scales.
4.2 Ring loss-based deep length normalization
The -constraint based deep length normalization explained in Section 3.2 uses the norm constraint right before the softmax loss. However, according to [22], such a direct approach through the hard normalization operation results in a non-convex formulation. It results in local minima generated by the loss function itself and leads to difficulties in optimization. It is important to preserve convexity in loss functions for more effective minimization of the loss given that the network optimization itself is non-convex. To deal with this issue, we apply ring loss [22] that normalizes deep speaker embeddings through a convex augmentation of the primary loss function (such as softmax loss [23] or A-softmax loss [26]). To the best of our knowledge, this is the first work to apply ring loss to speaker verification systems. Ring loss is defined as
[TABLE]
where is the speaker embedding for the sample . Here, is the target norm value which is learned during training, is the batch size, and , which is the average -norm of the input embedding vectors for each mini-batch. The loss encourages the norm of the embeddings being value (a learned parameter) rather than explicit enforcing through a hard normalization operation as in the -constraint based method. The total objective function is formulated as
[TABLE]
where is the primary loss function. A scalar is used for balancing the two loss functions, which is the only hyperparameter in ring loss. In this work, the obtained from the first iteration of training is used as the initial value of .
5 Experimental setup
5.1 Datasets
In this paper, we train our models on the VoxCeleb1 dataset [14]. The VoxCeleb1 dataset is a large scale text-independent speaker recognition dataset, which contains over 140,000 utterances from 1,251 distinct celebrities, in real-world conditions. For the speaker verification task, there are a total of 1,211 speakers in the development set and the rest 40 speakers are reserved as the test set. For further details, please refer to [14].
We report the equal error rate (EER) and the minimum detection cost function (DCF) [33] at = 0.01 and = 0.001. Verification trials are scored using cosine distance.
5.2 Implementation details
The input acoustic features are 64-dimensional Fbank features with a frame-length of 25 ms, which are mean-normalized over a sliding window of up to 3 s. Both voice activity detection (VAD) and data augmentation are not applied in the systems.
For each training step, an integer is randomly selected within [300, 500] interval, and the input utterance is cropped or extended to frames. Thus, the input size of the ResNet-34 model is as shown in Table 1. After training, the entire utterance is evaluated at once in the testing stage. The 256-dimensional speaker embeddings are extracted from a pooling layer. When deep length normalization is applied in training, we do not need an additional length normalization step in testing.
The models are implemented with PyTorch [34] and optimized by stochastic gradient descent with momentum 0.9. The mini-batch size is 64, and the weight decay parameter is 0.0001. We use the same learning rate schedule as in [18] with the initial learning rate of 0.1.
In LDE layers, the number of codewords is 64. We use the angular margin = 4 for A-softmax loss. The hyperparameter for ring loss is set to 1.
6 Results
6.1 Comparison of pooling methods
Table 2 compares the performance of different pooling methods. We use the softmax loss with ring loss-based deep length normalization for all cases. As in [18], temporal average pooling (TAP) is essentially the same as global average pooling, which takes the average over all elements in the 2D feature map. 1D-SPE is our proposed SPE layer, in which the SPD is applied along the time axis as explained in Section 4.1.
Both the SPP and LDE layers yield better performance than the simple TAP layer. They provide relative improvements of 2.6% and 6.3% in EER over the TAP layer, respectively. In both the SPP and SPE layers, the 1D-SPD performs better than the 2D-SPD. The best result (EER = 4.20%, DCF = 0.422, DCF = 0.528) is obtained when the 1D-SPE layer is used. We can see that our proposed SPE layer (1D-SPE) performs better than both the SPP and LDE layers, achieving relative improvements of 6.7% and 3.0% in EER, respectively.
6.2 Comparison of deep length normalization methods
In Table 3, we compare the performance of different deep length normalization methods. In the second column, we present the target norm value that we would like the speaker embeddings to be normalized to. In the -constraint based method (-Cons), is equal to defined in Section 3.2. “(F)” denotes that a fixed optimal value is used, and “(L)” denotes that the parameter is learned by the network rather than fixed.
The softmax loss is used in the first four entries, and the A-softmax loss is used in the last four entries. We observe that applying deep length normalization leads to performance improvement. For example, using the softmax loss with the ring loss (SM + Ring) shows a relative improvement of 32.8% in EER over using the softmax loss without the ring loss (SM). Furthermore, we can see that the proposed ring loss-based deep length normalization performs better than the -constraint based approach. When using the A-softmax loss, the ring loss achieves a relative improvement of 6.0% in EER over the -Cons with = 30. The best result (EER = 4.41%, DCF = 0.451, DCF = 0.559) is obtained when the A-softmax loss function is used with ring loss-based deep length normalization.
6.3 Comparison with recent methods
In Table 4, we compare our proposed system with recently reported SV systems in terms of EER. For fair comparisons, we do not include systems that are trained on a larger dataset such as VoxCeleb2 [37], or that use data augmentation such as [16]. The i-vector + PLDA system [35] uses 2,048 Gaussian components. VGG-M [14] is trained using contrastive loss with the TAP layer. VGG (1D) [35] uses a 1D-CNN instead of a 2D-CNN, and the statistics pooling layer. VGG-13 [36] is trained under the joint supervision of softmax loss and center loss. The ResNet-34 based systems in [18] use the TAP, SAP, and LDE layer, respectively. The ResNet-34 based system in [20] applies -constraint based deep length normalization.
The proposed system uses the SPE layer and A-softmax loss with ring loss. We obtain an EER of 4.03%, a DCF of 0.402, and a DCF of 0.492. Our model outperforms all other state-of-the-art systems, including i-vector and other d-vector systems. It yields relative improvements of 25.4% and 8.4% over the i-vector system and ResNet-34 + SAP (which shows the best performance among the baselines), respectively.
7 Conclusions
In this paper, we proposed spatial pyramid encoding to extract d-vectors for TI-SV. This method achieved better results than the LDE and SPP method. Furthermore, we applied ring loss-based deep length normalization, and it performed better than the existing -constraint based one. On the VoxCeleb1 dataset, our system using the SPE layer and ring loss obtained better performance than the state-of-the-art i-vector and d-vector baselines. In the future, we will explore how to automatically divide the feature maps of CNNs in the SPE layer.
8 Acknowledgements
This material is based upon work supported by the Ministry of Trade, Industry and Energy (MOTIE, Korea) under Industrial Technology Innovation Program (No.10063424, Development of distant speech recognition and multi-task dialog processing technologies for in-door conversational robots).
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech and Language Processing , vol. 19, no. 4, pp. 788–798, 2011.
- 2[2] S. Ioffe, “Probabilistic linear discriminant analysis,” in Proceedings of European Conference on Computer Vision (ECCV) , 2006, pp. 531–542.
- 3[3] P. Kenny, “Bayesian speaker verification with heavy tailed priors,” in Proceedings of Odyssey Speaker and Language Recognition Workshop , 2010, p. 14.
- 4[4] D. Garcia-Romero and C. Espy-Wilson, “Analysis of ivector length normalization in speaker recognition systems,” in Proceedings of Interspeech , 2011, pp. 249–252.
- 5[5] P. Kenny, V. Gupta, T. Stafylakis, P. Ouellet, and J. Alam, “Deep neural networks for extracting baum-welch statistics for speaker recognition,” in Proceedings of Odyssey Speaker and Language Recognition Workshop , 2014, pp. 293–298.
- 6[6] Y. Lei, N. Scheffer, L. Ferrer, and M. Mc Laren, “A novel scheme for speaker recognition using a phonetically-aware deep neural network,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2014, pp. 1695–1699.
- 7[7] E. Variani, X. Lei, E. Mc Dermott, I. Moreno, and J. Gonzalez Dominguez, “Deep neural networks for small footprint text-dependent speaker verification,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2014, pp. 4052–4056.
- 8[8] Y. Chen, I. Lopez-Moreno, T. N. Sainath, M. Visontai, R. Alvarez, and C. Parada, “Locally-connected and convolutional neural networks for small footprint speaker recognition,” in Proceedings of Interspeech , 2015, pp. 1136–1140.
