VoiceID Loss: Speech Enhancement for Speaker Verification

Suwon Shon; Hao Tang; James Glass

arXiv:1904.03601·eess.AS·July 8, 2019

VoiceID Loss: Speech Enhancement for Speaker Verification

Suwon Shon, Hao Tang, James Glass

PDF

TL;DR

This paper introduces VoiceID loss, a new training method for speech enhancement that improves speaker verification robustness by using feedback from a verification model to generate targeted masks, outperforming traditional loss functions.

Contribution

The paper presents VoiceID loss, a novel loss function that leverages speaker verification feedback to enhance speech signals specifically for verification tasks.

Findings

01

Improved speaker verification accuracy in noisy environments.

02

Enhanced model's ability to ignore noise-dominated spectrogram components.

03

Consistent performance gains on both clean and noisy data.

Abstract

In this paper, we propose VoiceID loss, a novel loss function for training a speech enhancement model to improve the robustness of speaker verification. In contrast to the commonly used loss functions for speech enhancement such as the L2 loss, the VoiceID loss is based on the feedback from a speaker verification model to generate a ratio mask. The generated ratio mask is multiplied pointwise with the original spectrogram to filter out unnecessary components for speaker verification. In the experiments, we observed that the enhancement network, after training with the VoiceID loss, is able to ignore a substantial amount of time-frequency bins, such as those dominated by noise, for verification. The resulting model consistently improves the speaker verification system on both clean and noisy conditions.

Figures14

Click any figure to enlarge with its caption.

Tables3

Table 1. Table 1: Masking network model structure.

\hlineB2 Layer	Filters / Size	Dilation	Context
\hlineB2 conv1	48 / 1 x 7	1x1	1x7
conv2	48 / 7 x 1	1x1	7x7
conv3	48 / 5 x 5	1x1	11x11
conv4	48 / 5 x 5	2 x 1	19x11
conv5	48 / 5 x 5	4 x 1	35x11
conv6	48 / 5 x 5	8 x 1	67x11
conv7	48 / 5 x 5	1 x 1	71x15
conv8	48 / 5 x 5	2 x 2	79x23
conv9	48 / 5 x 5	4 x 4	95x39
conv10	48 / 5 x 5	8 x 8	127x71
conv11	1 / 1 x 1	1 x 1	127x71
\hlineB2

Table 2. Table 2: EER results obtained on the Voxceleb1 test set.

\hlineB2 Verification network

Using original set

𝒟

Using original and augmented set

𝒟

+

𝒟^{𝒩}

Enhancement

-

Proposed

DAE

-

Proposed

DAE

Enhancement network

-

Using

𝒟^{𝒩}

Using

𝒟

+

𝒟^{𝒩}

-

Using

𝒟

+

𝒟^{𝒩}

Using

𝒟

+

𝒟^{𝒩}

Type

SNR

EER

DCF

EER

DCF

EER

DCF

EER

DCF

EER

DCF

EER

DCF

\hlineB2 Original test set

𝒯

7.73

0.608

6.99

0.590

7.73

0.608

7.01

0.592

6.79

0.574

6.93

0.589

Noise

20

10.34

0.761

8.10

0.675

10.02

0.738

8.08

0.659

7.83

0.639

8.28

0.671

15

13.05

0.909

9.32

0.699

11.45

0.833

8.99

0.720

8.69

0.686

8.96

0.761

10

17.71

0.987

11.24

0.770

14.00

0.943

10.36

0.770

9.86

0.747

10.73

0.869

5

24.34

0.999

14.78

0.885

18.01

0.988

12.90

0.851

12.26

0.830

13.51

0.958

0

31.76

1.000

20.82

0.983

23.87

0.998

17.68

0.945

16.56

0.938

18.32

0.994

Music

20

8.97

0.710

7.54

0.666

9.32

0.714

7.73

0.670

7.48

0.635

7.82

0.651

15

10.60

0.764

8.23

0.715

10.27

0.743

8.43

0.695

8.10

0.677

8.42

0.692

10

14.10

0.883

9.72

0.760

11.75

0.808

9.73

0.760

9.13

0.733

9.54

0.728

5

20.37

0.992

13.00

0.819

15.15

0.941

12.28

0.833

11.44

0.818

11.76

0.846

0

29.03

1.000

18.89

0.937

20.41

0.993

17.45

0.935

16.24

0.913

15.96

0.961

Babble

20

12.87

0.837

10.16

0.781

11.34

0.778

9.17

0.725

8.99

0.705

9.55

0.723

15

18.83

0.931

13.50

0.864

14.45

0.881

11.68

0.793

11.25

0.807

12.10

0.801

10

28.78

0.991

21.18

0.944

21.37

0.969

17.38

0.922

16.66

0.926

17.41

0.941

5

38.74

1.000

33.39

0.996

33.14

0.997

28.21

0.992

27.12

0.996

29.19

0.992

0

44.64

1.000

42.20

1.000

43.30

0.999

38.72

1.000

37.96

1.000

41.11

0.999

Reverb

Small room

13.81

0.835

10.02

0.744

13.54

0.831

10.52

0.725

9.94

0.708

11.52

0.814

Large room

13.74

0.825

10.11

0.756

14.09

0.999

10.64

0.724

10.17

0.691

11.47

0.792

Table 3. Table 3: Performance under unseen noise type (musical noise). Both verification and mask network trained noise augmented development set without any musical noise ( 𝒟 𝒩 − ℳ superscript 𝒟 𝒩 ℳ \mathcal{D^{N-M}} ).

Verification network

Using

𝒟

Using

𝒟 + 𝒟^{𝒩 - ℳ}

Unseen

Noise

SNR

Enhancement

EER

DCF

EER

DCF

\hlineB2 20

-

8.97

0.710

8.00

0.702

Proposed

7.94

0.701

7.76

0.691

DAE

9.53

0.729

7.90

0.673

15

-

10.60

0.764

8.66

0.732

Proposed

8.74

0.747

8.49

0.712

DAE

10.49

0.777

8.52

0.732

10

-

14.10

0.883

10.32

0.774

Proposed

10.62

0.811

9.95

0.766

DAE

12.76

0.847

9.93

0.771

5

-

20.37

0.992

13.57

0.872

Proposed

14.19

0.864

12.39

0.865

DAE

16.92

0.953

13.05

0.846

0

-

29.03

1.000

19.73

0.970

Proposed

21.00

0.965

17.28

0.958

DAE

24.01

0.998

18.94

0.955

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

VoiceID Loss: Speech Enhancement for Speaker Verification

Abstract

In this paper, we propose VoiceID loss, a novel loss function for training a speech enhancement model to improve the robustness of speaker verification. In contrast to the commonly used loss functions for speech enhancement such as the L2 loss, the VoiceID loss is based on the feedback from a speaker verification model to generate a ratio mask. The generated ratio mask is multiplied pointwise with the original spectrogram to filter out unnecessary components for speaker verification. In the experiments, we observed that the enhancement network, after training with the VoiceID loss, is able to ignore a substantial amount of time-frequency bins, such as those dominated by noise, for verification. The resulting model consistently improves the speaker verification system on both clean and noisy conditions.

Index Terms: speech enhancement, speaker verification

1 Introduction

By exploiting large quantities of data, especially via data augmentation [1, 2], speaker embedding methods are now able to surpass the conventional i-vector [3] approach. Many variants, largely based on multiclass classification, have been proposed to extract robust embeddings from speech. The freely available speaker recognition dataset, Voxceleb [4, 5], also accelerated speaker verification improvements by having a common benchmark for different approaches to compare to.

While noise robustness is a general and hard problem for many speech processing tasks, there are relatively few studies about the use of speech enhancement for speaker recognition tasks. This is because, rather than having multiple preprocessing steps to remove noise, training speaker recognition systems using a large and diverse dataset is a simple and powerful solution. Speaker recognition systems naturally become robust to noisy environments when trained on a large dataset augmented with real or synthetic noise types. This approach is especially appealing for large, overparameterized neural networks. Besides, the objective of speech enhancement is to improve the speech quality by suppressing noise, and makes no guarantees to downstream tasks such as speaker verification. Even worse, the artifacts and distortions caused by speech enhancement might even deteriorate speaker verification performance [6].

For these reasons, only a few studies have explored speech enhancement for speaker verification [7, 6] and most recent studies are based on the i-vector approach [8, 9, 10]. They used a Denoising Autoencoder (DAE) to generate an enhanced signal from the noisy signal. As shown in Figure 1.(a), the objective of DAE is to minimize the L2 loss between the output of the model and the clean speech. During training, not only noisy-clean pairs but clean-clean pairs are also needed to prevent the DAE from deteriorating the quality of the clean signal. These studies show improvement on the noisy and mismatched conditions, but have marginal gains on the clean and matched conditions. This is expected because the objective of DAE is to generate outputs that are closer to the inputs. Another study [11] avoids deterioration from artifacts and distortions by training an individual speaker verification system using enhanced clean speech for each enhancement approach.

To improve the effects of speech enhancement for speaker verification, we introduce a VoiceID loss that uses the error signal of a speaker verification model to train a speech enhancement model. The overall structure of this network is shown in Figure 1. Rather than minimizing the L2 loss between the output and the clean speech, we pass the enhanced signals to the speaker verification model, and compute the multiclass cross entropy between the output of the speaker model and the ground truth speaker label. The speech enhancement model is updated based on the cross entropy loss. Since the speaker verification system is a Deep Neural Network (DNN), the gradient can be backpropagated end to end.

VoiceFilter [12] is a similar approach to separate the voice of interest. They generate a ratio mask to filter out unwanted speaker’s voice from the mixture of multiple speakers to obtain homogeneous speech of the target speaker. By pairing the training set with different noise types, this system could be also used for speech enhancement. However, to enable this, the system needs a strong prior, a speaker embedding, from the target speaker. Thus, the performance of separation and enhancement would heavily depend on the speaker embedding, which can be unreliable when there is only a small amount of speech from the target speaker. It is also not applicable to unseen speakers.

Another similar approach [13] is introduced for Automatic Speech Recognition (ASR). They have a DNN-based spectral mapper that extracts robust features from noisy speech. The spectral mapper is trained on a fidelity loss and a mimic loss. The fidelity loss is an L2 loss between the output of the spectral mapper on the clean speech and the output on the noisy speech. The mimic loss, however, is an L2 loss between the posterior probabilities of senones on the clean speech and the noisy speech. Ideally, the mimic loss should be an error between the ground truth senone class and the posterior probabilities of senones on the noisy speech. However, it is common that the ground truth alignments on the speech are not accessible, so they use the posterior probabilities on the clean speech as the ground truth. For this reason, the mimic loss might not be sufficient to minimize the word error rate on the noisy speech, and the fidelity loss is required to compensate the mismatch. This is also shown in their empirical results, where the best scale between losses is achieved with 90% of fidelity loss and only 10% of mimic loss.

In this paper, we only use the feedback from the verification network, i.e., the VoiceID loss. In this way, the enhancement network has the flexibility to find the important time-frequency bins for speaker verification, though the quality of speech might not be improved. From the experiment, we observed this the VoiceID loss is able to remove the noisy bins to improve the speaker verification performance on both noisy and clean condition, even for the unseen type of noise. We describe a system architecture and experimental result in subsequent sections.

2 Speech enhancement using VoiceID loss

The system architecture is shown in Figure 2. At first, the verification network needs to be trained using a training dataset. After training, the weights of verification network are held fixed, i.e., not updated in the subsequent steps. At the second step, the masking network and the verification network are connected to each other to form a single network. Specifically, the masking network generates a ratio mask from the input spectrogram. Then the mask is multiplied pointwise with the input spectrogram. Finally, the masked spectrogram is fed into the verification network to generate a verification output. The cross entropy loss is computed based on the output and the ground truth speaker label of the input speech.

2.1 Noisy dataset generation

First, we generate a set of data for multiple training and test settings. We use the Voxceleb1 development set ( $\mathcal{D}$ ) to train the networks and the test set ( $\mathcal{T}$ ) to validate the networks. Since the dataset is collected from YouTube, the dataset is moderately noisy, but we regard the original set as the clean set. We use the noise recordings from MUSAN [14] to generate corrupted versions of the Voxceleb1 development set ( $\mathcal{D}^{\mathcal{N}}$ ) and the test set ( $\mathcal{T}^{\mathcal{N}}$ ). Specifically, we divide the MUSAN dataset into two disjoint sets, each of which are used to augment the development and test set of Voxceleb1. We make sure we have the same types of noise in both the development and the test set, and the noise samples used to augment the test set are not seen in the development set. MUSAN consists of 4 categories of noise types: noise, music, babble, and reverberation. (The noise category contains multiple types of stationary and non-stationary noises. See [14]). For the development set ( $\mathcal{D}^{\mathcal{N}}$ ), we corrupt each utterance with a Signal to Noise Ratio (SNR) randomly chosen between 0 and 20 in linear scale. For the test set ( $\mathcal{T}^{\mathcal{N}}$ ), we consider all four types of noise and all SNRs (in db) in the set {0, 5, 10, 15, 20}.

The Voxceleb1 development set ( $\mathcal{D}$ ) has a total of 147,935 utterances from 1,211 speakers. The noise augmented dataset ( $\mathcal{D}^{\mathcal{N}}$ ) is generated with same amount as the original development set ( $\mathcal{D}$ ). The Voxceleb1 test set ( $\mathcal{T}$ ) has 18,860 verification pairs for each positive and negative test pair, i.e., a combination of 4,715 utterances from 40 speakers. Each noisy test set ( $\mathcal{T}^{\mathcal{N}}$ ) is generated with same amount as the original test set ( $\mathcal{T}$ ).

Bibliography23

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] D. Snyder, P. Ghahremani, D. Povey, D. Garcia-Romero, and Y. Carmiel, “Deep Neural Network Embeddings for Text-Independent Speaker Verification,” in Interspeech , 2017, pp. 999–1003.
2[2] G. Heigold, I. Moreno, S. Bengio, and N. Shazeer, “End-to-End Text-Dependent Speaker Verification,” in ICASSP , 2016, pp. 5115–5119.
3[3] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-End Factor Analysis for Speaker Verification,” IEEE Trans. on Audio, Speech, and Lang. Process. , vol. 19, no. 4, pp. 788–798, May 2011.
4[4] A. Nagraniy, J. S. Chung, and A. Zisserman, “Vox Celeb: A large-scale speaker identification dataset,” in Interspeech , 2017, pp. 2616–2620.
5[5] J. S. Chung, A. Nagrani, and A. Zisserman, “Vox Celeb 2: Deep Speaker Recognition,” in Interspeech , 2018, pp. 1086–1090.
6[6] S. O. Sadjadi and J. H. Hansen, “Assessment of single-channel speech enhancement techniques for speaker identification under mismatched conditions,” in Interspeech , 2010, pp. 2138–2142.
7[7] J. Ortega-García and J. González-Rodríguez, “Overview of speech enhancement techniques for automatic speaker recognition,” in Proceeding of Fourth International Conference on Spoken Language Processing (ICSLP) , 1996, pp. 929–932.
8[8] O. Plchot, L. Burget, H. Aronowitz, and P. Matějka, “Audio enhancing with DNN autoencoder for speaker recognition,” in ICASSP , vol. 2016-May, 2016, pp. 5090–5094.