Towards Adapting NMF Dictionaries Using Total Variability Modeling for Noise-Robust Acoustic Features
Kunal Dhawan, Colin Vaz, Ruchir Travadi, Shrikanth Narayanan

TL;DR
This paper introduces a novel noise-robust acoustic feature extraction method that adapts NMF dictionaries using Total Variability Modeling, without requiring parallel clean-noisy speech data, and demonstrates competitive performance on noisy speech recognition tasks.
Contribution
The paper presents a new algorithm combining Total Variability Modeling with NMF for utterance-specific noise adaptation, avoiding the need for parallel training data.
Findings
Features perform comparably to baseline features on noisy data.
Proposed features are robust to unseen noise conditions.
Method does not require parallel clean-noisy speech corpus.
Abstract
We propose an algorithm to extract noise-robust acoustic features from noisy speech. We use Total Variability Modeling in combination with Non-negative Matrix Factorization (NMF) to learn a total variability subspace and adapt NMF dictionaries for each utterance. Unlike several other approaches for extracting noise-robust features, our algorithm does not require a training corpus of parallel clean and noisy speech. Furthermore, the proposed features are produced by an utterance-specific transform, allowing the features to be robust to the noise occurring in each utterance. Preliminary results on the Aurora 4 + DEMAND noise corpus show that our proposed features perform comparably to baseline acoustic features, including features calculated from a convolutive NMF (CNMF) model. Moreover, on unseen noises, our proposed features gives the most similar word error rate to clean speech…
| TVM Inputs | NMF Matrices |
|---|---|
| Features | Magnitude spectrogram |
| GMM mean supervector | Vectorized NMF dictionary |
| Cluster posteriors | Normalized activation matrix |
| Condition | MFCC | PNCC | CNMF | Proposed |
|---|---|---|---|---|
| clean | ||||
| noisy, seen | ||||
| noisy, unseen | ||||
| Average |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
Towards Adapting NMF Dictionaries Using Total Variability Modeling for Noise-Robust Acoustic Features
Abstract
We propose an algorithm to extract noise-robust acoustic features from noisy speech. We use Total Variability Modeling in combination with Non-negative Matrix Factorization (NMF) to learn a total variability subspace and adapt NMF dictionaries for each utterance. Unlike several other approaches for extracting noise-robust features, our algorithm does not require a training corpus of parallel clean and noisy speech. Furthermore, the proposed features are produced by an utterance-specific transform, allowing the features to be robust to the noise occurring in each utterance. Preliminary results on the Aurora 4 + DEMAND noise corpus show that our proposed features perform comparably to baseline acoustic features, including features calculated from a convolutive NMF (CNMF) model. Moreover, on unseen noises, our proposed features gives the most similar word error rate to clean speech compared to the baseline features. ††footnotetext: The authors would like to acknowledge the support of NIH grant R01DC007124, NSF grant 1514544, and the IUSSTF-Viterbi program.
Index Terms: Automatic speech recognition, ivectors, NMF, total variability modeling
1 Introduction
Automatic speech recognition (ASR) systems are being increasingly deployed on a wide range of devices for a wide range of applications. Speech offers a natural and efficient way to interact with these devices. Furthermore, speech contains paralinguistic content that devices can use to modify their outputs or behavior. For example, ASR systems in call centers can use the emotion of customers to better serve them or mitigate conflicts [1]. Given the wide usage scenarios, ASR systems need to perform robustly in different acoustic environments, with various background noises and channel conditions, and reliably recognize speech with different dialects and accents. Thus, there has been increasing research in making ASR systems more robust to various real-world conditions. Some techniques researchers have developed include speech denoising [2], feature enhancement [3], feature transformation [4], and acoustic model adaptation [5, 6].
Speech denoising is one straightforward way to make ASRs robust to background noise. Pre-processing the speech with a noise-removal algorithm reduces the mismatch between features extracted at test time compared to the features used to train the ASR. Common speech denoising algorithms include Weiner filtering and spectral subtraction [7]. Non-negative matrix factorization (NMF) [8, 9] is also widely used for denoising. The drawback with speech denoising is that it usually introduces distortion and artifacts, such as musical noise, and has been shown to degrade ASR performance [10, 11]. Moreover, the artifacts are usually amplified when the background noise is highly non-stationary or energetic.
To overcome the drawbacks of speech denoising, researchers have investigated extracting acoustic features directly from noisy speech that are robust to noise. Moreno et al. introduced Vector Taylor Series (VTS) features [12], which uses the Taylor series expansion of the noisy signal to model the effect of noise and channel characteristics on the speech statistics. Deng et al. proposed the Stereo-based Piecewise Linear Compensation for Environments (SPLICE) algorithm [13] for generating noise-robust features by assuming that clean speech cepstral vectors have a piece-wise linear relationship to noisy speech cepstral vectors. Power-Normalized Cepstral Coefficients (PNCC) [14] draw inspiration from human auditory processing for generating noise-robust features, and were shown to reduce word error rates on noisy speech compared to Mel-Frequency Cepstral Coefficients (MFCC) and Relative Spectral Perceptual Linear Prediction (RASTA-PLP) coefficients. Recently, an NMF-based approach was proposed [15], where speech and noise dictionaries are trained on clean and noisy speech, and the coefficients in terms of these dictionaries are used as acoustic features. They showed that the NMF-based approach gives better ASR performance than log-mel features or denoising the speech.
In this work, we expand upon the NMF-based approach. On a training set, we learn a universal background model (UBM) dictionary, and then use total variability modeling [16] to learn a subspace for adapting the UBM dictionary to different noise and channel conditions. The advantages of this proposed method over the one in [15] are two-fold:
The training set does not require parallel clean and noisy utterances, and 2. 2.
The dictionary can be adapted for each utterance at test time, allowing for better modeling of the acoustic conditions in each utterance.
In the following sections, we provide a brief overview of NMF and total variability modeling, followed by our proposed noise-robust acoustic feature algorithm. Section 4 describes our experiments and offers insights into the results, and Section 5 offers our concluding remarks and future directions.
2 Background
2.1 Non-negative Matrix Factorization
NMF decomposes a non-negative matrix into the product of a non-negative dictionary and non-negative activation matrix . Because of the non-negative constraint, the decomposition is purely additive, and one can think of the dictionary as containing components that are added together by the activation matrix to approximate the input matrix. In the case of speech, the input matrix is typically the magnitude spectrogram, and the dictionary contains spectral “building blocks” required to reconstruct the spectrogram.
For speech processing, it has been shown that the generalized KL divergence gives slightly better performance as the NMF cost function compared to the squared Euclidean distance [17]. Defining , the generalized KL divergence between and is
[TABLE]
where refers to the element in row and column of . Lee and Seung derived iterative multiplicative updates for and to minimize Equation 1 [9]. The advantages of using multiplicative updates over standard gradient descent updates are that no step size parameter is required and and are guaranteed to stay non-negative at each iteration if they are initialized with non-negative values. The NMF decomposition for the case of speech can be visualized in Figure 1.
2.2 Total Variability Modeling
The Total Variability Model (TVM) [16] is a tool which can be used to capture distributional differences between sequences of feature vectors within a fixed dimensional representation. In particular, the assumption is that the feature vectors follow a distribution which has the form of a Gaussian Mixture Model (GMM) where the mean vectors corresponding to different Gaussian components vary across different utterances (in a constrained manner).
Let be the collection of acoustic feature vectors in a dataset comprising utterances, where denotes the feature vector sequence of length from a specific utterance . Let be the dimensionality of each feature vector: .
It is assumed that with every utterance , there is an associated vector , known as the ivector for that utterance, such that the conditional distribution of given is a GMM with components, and parameters where and . The prior distribution for is assumed to be standard normal:
[TABLE]
Let , known as supervectors, denote vectors consisting of stacked global and utterance-specific component means and respectively. Then, TVM can be summarized as:
[TABLE]
where is given as: \mathbf{T}=\left[\BMAT(r){c.c.c}{c}\,\mathbf{T}_{1}^{\mathsf{T}}\,&\,\dots\,\,\mathbf{T}_{C}^{\mathsf{T}}\,\right]^{\mathsf{T}}
3 Algorithm
In this section, we describe an algorithm that uses TVM to adapt an NMF dictionary to the noise in an input spectrogram. The idea is for the dictionary to capture as much of the noise in the spectrogram as possible so that the activation matrix is not affected by noise. We will use the activation matrix as acoustic features for ASR on noisy speech. In our algorithm, the input features for the total variability model are magnitude spectrogram, the dictionary vectors play the role of GMM mean vectors, and the column-normalized activation matrix act as the posteriors of each GMM component. Table 1 summarizes how we fit NMF in the TVM paradigm, and the following subsections describe the algorithm in detail.
3.1 Step 1: Learning UBM Dictionary
We concatenate the utterances from the training set with various noisy conditions and compute the magnitude spectrogram . We use NMF to decompose into a dictionary and activation matrix . We will refer to as the UBM dictionary as it models the salient spectral components in the presence of all sources of variability. Thus, one can think of the vectors in the UBM dictionary as containing salient points in the feature space.
It has been reported in literature that incorporating a sparsity constraint on the activation matrix while applying NMF leads to a more expressive dictionary [18, 19]. Thus, we add an penalty on the activation matrix to the generalized KL divergence cost function to encourage sparsity:
[TABLE]
where and controls the level of sparsity of . To minimize Equation 4, we iteratively update and with the following multiplicative updates:
[TABLE]
where means element-wise multiplication and the division is element-wise.
We stack the columns to form the UBM dictionary supervector . This will act as the mean supervector for the rest of the steps.
3.2 Step 2: Calculation of Sufficient Statistics and Total Variability Matrix
In this step we calculate the 0th and 1st order sufficient statistics from each of the training files and use that to estimate the total variability matrix. We assume that the magnitude spectrograms are drawn from a multivariate log-normal distribution, so is drawn from a multivariate normal distribution. Thus, we will calculate the statistics for and model the total variability subspace of . Note that all the matrices involved have non-negative entries, so there are no issues when taking the log.
For each utterance in the training set, we calculate it’s magnitude spectrogram . Using as calculated in Step 1, we find the activation matrix as:
[TABLE]
where . Define as the columns of normalized to sum to . Then, each column of represents the probability distribution of each time frame of being represented by the vectors in the dictionary.
Next, we calculate the -order statistic , -order statistic , and centered -order statistic for each training utterance as:
[TABLE]
where is a diagonal matrix formed by . Notice that is a vector of size and is a matrix of size , where is the dimension of the features.
We estimate the covariance matrix as below:
[TABLE]
for . Once we have calculated the sufficient statistics for each input utterance, we use this to estimate the total variability matrix by performing iterations of the EM algorithm. In each iteration, the posterior mean and covariance of the ivectors are estimated during the E step given the current estimate of as below:
[TABLE]
where is a supervector formed by stacking together columns of the matrix and is the block-diagonal covariance matrix formed by . Then, in the M step, the matrices are updated as below:
[TABLE]
for .
3.3 Step 3: Extraction of Features
Given an utterance , we find it’s magnitude spectrogram and use this to calculate , , , and using Equations 6 and 7. Then, given , , , , and , we obtain the i-vector for this utterance using the posterior mean estimate given in Equation (9).
Thereafter, we calculate the adapted dictionary supervector (which now models the noise type and microphone factors because this information was captured by ) as:
[TABLE]
Notice that a vector, so we reshape to get a non-negative adapted dictionary . With fixed, we run the NMF algorithm on to find the corresponding activation matrix . We use as features for training the acoustic model in an ASR system.
4 Experiments and Results
We investigated the performance of our algorithm on the clean speech in the Aurora 4 corpus [20] with added noise from the DEMAND dataset [21]. The training set consists of 7138 utterances from the Aurora 4 training set corrupted by one of six different noises (labeled in the DEMAND dataset as “dliving”, “npark”, “omeeting”, “presto”, “straffic”, and “tcar”) at 5–15 dB SNR. The test set consists of 330 utterances from 8 speakers, with each of the utterances corrupted by the same six noises with SNRs ranging from 5–15 dB. Additionally, we created a second test set with “ohallway”, “pstation”, and “spsquare” noises added to test the ASR performance in unseen noise conditions.
We compared our proposed features to MFCCS, PNCCs [14], and the noise-robust features proposed in [15] (we will refer to these features as CNMF Features since that algorithm uses convolutive NMF (CNMF) to generate features). For our proposed features, we performed NMF on -dimensional spectrograms with dictionary vectors. We set the dimension of the total variability subspace . To match the parameters for the proposed features, we also used dictionary vectors for the CNMF Features. We extracted 13-dimensional MFCC and PNCC features. For each of the features, we applied a speaker-independent global mean and variance normalization prior to augmenting them with delta and delta-delta, followed by Linear Discriminant Analysis (LDA) and Maximum Linear Likelihood Transform (MLLT). We input the transformed features into a fully-connected 4-layer neural network, with 1024 hidden nodes per layer. The network uses tanh non-linearities and minimizes the cross-entropy using stochastic gradient descent.
Table 2 shows the word error rates (WER) on the Aurora 4 + DEMAND test set for our proposed features and the three baseline features for clean speech, noisy speech with noise seen during training, and noisy speech with noise not seen during training. We also provide the weighted average WER for the three conditions. From the results, one can see that clean speech has the lowest WER while the performance degrades with noise for all feature sets. At first glance, it seems surprising that the performance on unseen noises is better than on seen noises. But we note that in general the seen noises more background speech and non-stationary characteristics compared to the unseen noises, and these characteristics, particularly background speech, makes ASR more challenging. Indeed, when we inspected the results, we noticed that performance on “presto” noise (restaurant noise) had more than twice the WER compared to other noises.
Unfortunately we were not able achieve a lower WER with the proposed features compared to the baseline features. This is most likely due to the fact that the proposed features are computed from a per-utterance transform while MFCC, PNCC, and CNMF features are computed from fixed transforms, so the proposed features are sensitive to the parameters chosen when learning the transform and how well the total variability matrix captures the sources of variability in the training set. In particular, the subspace dimension is very important because underestimating leads to poor modeling of the sources of variability, while overestimate will result in capturing extraneous information. Moreover, we believe that utterances in the training set may not be an adequate amount of observations to properly learn the total variability subspace. In fact, our initial experiments were carried out the standard Aurora 4 dataset, which includes microphone variability in addition to speaker and noise variability. Given the limited amount of training data, having additional variability due to channel conditions resulted WER that was much greater for Aurora 4 than for Aurora 4 + DEMAND111The WER for all feature sets was greater with Aurora 4 compared to Aurora 4 + DEMAND due to the multicondition style training. However, the amount of change in WER was much greater for the proposed features compared to the baseline features. Therefore, we are confident that a larger training set should result in a WER competitive with the other features as it will allow for better modeling of the total variability subspace.
On the other hand, one can see that the proposed features have the smallest gap in WER between the clean and unseen noise conditions compared to the baseline features. The main motivation behind this work is to generate acoustic features that are robust to acoustic conditions, so this result gives an indication that our algorithm can achieve this goal. This finding alone gives us good reason to improve the overall performance because it will allow ASR systems to perform near clean speech WER without requiring to re-train the acoustic model for specific acoustic conditions.
5 Conclusion
We proposed an algorithm to calculate noise-robust acoustic features from noisy utterances. The algorithm uses Total Variability Modeling to learn a total variability subspace and adapt a UBM NMF dictionary for each utterance at test time. We use the NMF activation matrix corresponding to the adapted dictionary as the acoustic features. Thus, our proposed features are calculated from per-utterance transforms, which could lead to greater robustness to the specific noise present in each utterance. Moreover, our algorithm, which builds upon the work in [15], does not require a training dataset of parallel clean and noisy speech. While the proposed features did not perform better than baseline features on the Aurora 4 + DEMAND corpus, we note that the WER was more consistent across clean and noisy conditions, in particular the unseen noise condition. This gives an indication that our approach has the potential to perform robustly in different noise conditions, but the overall results suggest that we should explore training with a larger dataset to better learn the total variability subspace.
Going forward, we will first retrain our system on a larger dataset, such as the Librispeech corpus [22]. Also, we will test the performance of the proposed algorithm in the presence of channel variability to more accurately simulate real-world acoustic conditions.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] D. Pappas, I. Androutsopoulos, and H. Papageorgiou, “Anger detection in call center dialogues,” in IEEE Int. Conf. Cognitive Infocommunications , Györ, Hungary, 2015, pp. 139–144.
- 2[2] D. Macho, L. Mauury, B. Noé, Y. M. Cheng, D. Ealey, D. Jouvet, H. Kelleher, D. Pearce, and F. Saadoun, “Evaluation of a noise-robust dsr front-end on aurora databases,” in Proc. Int. Conf. Spoken Lang. Process. , 2002, pp. 17–20.
- 3[3] T. Yoshioka and T. Nakatani, “Noise model transfer: novel approach to robustness against nonstationary noise,” IEEE Trans. Acoustics, Speech, and Lang. Process. , vol. 21, no. 10, pp. 2182–2192, Oct. 2013.
- 4[4] J. Droppo, A. Acero, and L. Deng, “Evaluation of the splice algorithm on the aurora 2 database,” in Proc. Eurospeech , 2001, pp. 217–220.
- 5[5] O. Kalinli, M. L. Seltzer, J. Droppo, and A. Acero, “Noise adaptive training for robust automatic speech recognition,” IEEE Trans. Acoustics, Speech, and Lang. Process. , vol. 18, no. 8, pp. 1889–1901, Nov. 2010.
- 6[6] Y. Wang and M. J. F. Gales, “Speaker and noise factorization for robust speech recognition,” IEEE Trans. Acoustics, Speech, and Lang. Process. , vol. 20, no. 7, pp. 2149–2158, Sep. 2012.
- 7[7] S. F. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Trans. Acoustics, Speech, and Signal Process. , vol. 20, no. 2, pp. 113–120, Apr. 1979.
- 8[8] P. Paatero and U. Tapper, “Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values,” Environmetrics , vol. 5, no. 2, pp. 111–126, 1994.
