Mixed penalization in convolutive nonnegative matrix factorization for blind speech dereverberation
Francisco J. Ibarrola, Leandro E. Di Persia, Ruben D. Spies

TL;DR
This paper introduces a novel convolutive nonnegative matrix factorization approach with mixed penalization to improve blind speech dereverberation, demonstrating significant performance gains over existing methods.
Contribution
It proposes a new method combining two penalizers in convolutive NMF for better speech dereverberation in general conditions.
Findings
Significant improvement over state-of-the-art dereverberation methods.
Effective in restoring speech signals affected by room reverberation.
Algorithm demonstrates practical applicability in real-world scenarios.
Abstract
When a signal is recorded in an enclosed room, it typically gets affected by reverberation. This degradation represents a problem when dealing with audio signals, particularly in the field of speech signal processing, such as automatic speech recognition. Although there are some approaches to deal with this issue that are quite satisfactory under certain conditions, constructing a method that works well in a general context still poses a significant challenge. In this article, we propose a method based on convolutive nonnegative matrix factorization that mixes two penalizers in order to impose certain characteristics over the time-frequency components of the restored signal and the reverberant components. An algorithm for implementing the method is described and tested. Comparisons of the results against those obtained with state of the art methods are presented, showing significant…
| win. | window size | win. overlap. | max. iter. | |||
|---|---|---|---|---|---|---|
| 1 | 15 | Hann | 512 samples | 256 samples | 20 |
| Rev. time [ms] | Rev. Signal | Kameoka der. | Wisdom der. | Mixed penalization |
|---|---|---|---|---|
| 300 | 8.102 (1.96) | 7.950 (1.73) | 8.262 (1.53) | 8.658 (1.59) |
| 450 | 4.815 (1.42) | 5.127 (1.36) | 5.771 (1.28) | 6.539 (1.56) |
| 600 | 3.082 (1.20) | 3.358 (1.19) | 4.140 (1.17) | 4.732 (1.43) |
| 750 | 1.998 (1.11) | 2.184 (1.10) | 3.013 (1.12) | 3.440 (1.31) |
| Rev. time [ms] | Rev. Signal | Kameoka der. | Wisdom der. | Mixed penalization |
|---|---|---|---|---|
| 300 | 3.440 (0.44) | 4.057 (0.45) | 3.908 (0.48) | 3.521 (0.35) |
| 450 | 4.264 (0.44) | 4.636 (0.42) | 4.511 (0.41) | 3.985 (0.39) |
| 600 | 4.716 (0.46) | 5.006 (0.42) | 4.860 (0.40) | 4.370 (0.40) |
| 750 | 5.011 (0.48) | 5.264 (0.43) | 5.089 (0.40) | 4.657 (0.41) |
| Rev. time [ms] | Rev. Signal | Kameoka der. | Wisdom der. | Mixed penalization |
|---|---|---|---|---|
| 300 | 4.297 (1.78) | 2.901 (0.92) | 5.269 (2.36) | 5.207 (1.78) |
| 450 | 3.020 (1.15) | 2.173 (0.64) | 3.907 (1.63) | 4.305 (1.44) |
| 600 | 2.378 (0.86) | 1.786 (0.51) | 3.175 (1.27) | 3.698 (1.21) |
| 750 | 2.003 (0.71) | 1.551 (0.44) | 2.727 (1.07) | 3.301 (1.05) |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Blind Source Separation Techniques · Advanced Adaptive Filtering Techniques
Mixed penalization in convolutive nonnegative matrix factorization for blind speech dereverberation
Francisco J. Ibarrola Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional, sinc(i), FICH-UNL/CONICET, Argentina. Ciudad Universitaria, CC 217, Ruta Nac. 168, km 472.4, (3000) Santa Fe, Argentina. ([email protected]).
Leandro E. Di Persia ∗
Ruben D. Spies Instituto de Matemática Aplicada del Litoral, IMAL, CONICET-UNL, Centro Científico Tecnológico CONICET Santa Fe, Colectora Ruta Nac. 168, km 472, Paraje “El Pozo”, (3000), Santa Fe, Argentina and Departamento de Matemática, Facultad de Ingeniería Química, Universidad Nacional del Litoral, Santa Fe, Argentina.
Abstract
When a signal is recorded in an enclosed room, it typically gets affected by reverberation. This degradation represents a problem when dealing with audio signals, particularly in the field of speech signal processing, such as automatic speech recognition. Although there are some approaches to deal with this issue that are quite satisfactory under certain conditions, constructing a method that works well in a general context still poses a significant challenge. In this article, we propose a method based on convolutive nonnegative matrix factorization that mixes two penalizers in order to impose certain characteristics over the time-frequency components of the restored signal and the reverberant components. An algorithm for implementing the method is described and tested. Comparisons of the results against those obtained with state of the art methods are presented, showing significant improvement.
Keywords: signal processing, dereverberation, regularization.
1 Introduction
In recent years, many technological developments have attracted attention towards human-machine interaction. Since the most natural and easiest way of human communication is trough speech, much research effort has been put into achieving the same natural interaction with machines. This effort has already generated many advances in a wide variety of fields such as automatic speech recognition ([1]), automatic translation systems ([2]) and control of remote devices trough voice ([3]), to name only a few. A significant amount of work has been recently devoted to produce robustness in speech recognition ([4]), resulting in several advances in the areas of speech enhancement ([1], [5]), multiple sources separation ([6], [7]), and particularly in dereverberation techniques ([8]), which constitute the topic of this work.
When recorded in enclosed rooms, audio signals will most certainly be affected by reverberant components due to reflections of the sound waves in the walls, ceiling, floor or furniture. This can severely degrade the characteristics of the recorded signal ([9]), generating difficult problems for its processing, particularly when required for certain speech applications ([10]). The goal of any dereverberation technique is to remove or to attenuate the reverberant components in order to obtain a cleaner signal. The dereverberation problem is called “blind” when the available data consists only of the reverberant signal itself, and this is the problem we shall deal with in this work.
Depending on the problem, our observation might consist of a single or multi-channel signal. That is, we might have a signal recorded by one or more microphones. For the latter case, quite a few methods exist that work relatively well ([11], [12]).
For the single-channel case, we may distinguish between supervised and unsupervised approaches. The first kind refers to those that begin with a training stage that serves to learn some characteristics of the reververation conditions, while the second kind alludes to those methods that can be implemented directly over the reverberant signal. Some supervised methods ([13], [14], [15]) appear to perform somewhat better than unsupervised ones, but they pose the disadvantage of needing learning data corresponding to the specific room conditions, microphone and source locations, and a previous process that might take a significant amount of time.
In the context of unsupervised blind dereverberation, although some recently proposed methods ([12], [16]) work reasonably well, there is still much room for improvement. Our work is based on a convolutive non-negative matrix factorization (NMF) reverberation model, as proposed by Kameoka et al ([16]), along with a Bayesian approach for building a generalized functional that mixes two types of penalizers over the elements of the representation model. Mixed penalization approaches have been recently used and successfully applied by several authors in many areas, mainly in signal and image processing applications ([17], [18], [19], [20], [21]). These techniques have shown to produce good results in terms of enhancing certain desirable characteristics on the solutions while precluding unwanted ones.
1.1 A Reverberation Model
Let , with support in , be the functions associated to the clean and reverberant signals, respectively. As it is customary, we shall assume that the reverberation process is well represented by a Linear Time-Invariant (LTI) system. Thus, the reverberation model can be written as
[TABLE]
where is the room impulse response (RIR) signal, and “” denotes convolution. This LTI hypothesis implies we are assuming the source and microphone positions to be static, and the energy of the signal to be low enough for the effect of the non-linear components to be relatively insignificant.
When dealing with sound signals (particularly speech signals), it is often convenient to work with the associated spectrograms rather than the signals themselves. Thus, we make use of the short time Fourier transform (STFT), defined as
[TABLE]
where is a compactly supported, even function such that . This function is called window.
In practice, we work with discretized versions of the signals involved ( and ). With this in mind, we shall define the discrete STFT as
[TABLE]
Denoting the STFTs of and by and , respectively, a discretized approximation of the STFT model associated to (1) is given by
[TABLE]
where is a discretized time variable that corresponds to window location, denotes the frequency subband and is a parameter of the model associated to the expected maximum duration of the reverberation phenomenon. The model is built as in [22], being the approximation due to the use of badn-to-band filters only. Later on, the values of will be chosen in such a way that the union of the windows’ supports contain the support of the observed signal, and the values of in such a way that they cover the whole frequency spectrum, up to half the sampling frequency.
Now, let us write . It is well known ([23]) that the phase angles are highly sensitive with respect to mild variations on the reverberation conditions. To overcome the problems derived from this, we shall proceed (see [16]) treating the variables as i.i.d. random variables with uniform distribution in . Denoting the complex conjugate by “∗” and the Kronecker delta by , the expected value of is given by
[TABLE]
Note that the interval choice for is arbitrary, since this result holds for any length interval. Finally, let us define , and . Then, our model reads
[TABLE]
and the square magnitude of the observed spectrogram components can be written as
[TABLE]
where denotes the representation error. As shown in [16], this model is equivalent to a convolutive NMF ([24]) with diagonal basis. In the next section, we derive a cost function in order to find an appropriate convolutive representation that allows us to isolate the components .
2 A Bayesian approach
In the following, we will use a Bayesian approach to derive a cost function which we will then minimize in order to obtain our regularized solution. Let us begin by assuming, for every , are independent random variables, also independent with respect to . Also, let us denote by and the non-negative matrices whose -th elements are and , respectively.
As it is customary ([16]), for the representation error, we assume , where is an unknown parameter, and the variables are non-correlated with respect to . Hence, it follows from (4) that the conditional distribution of given and (i.e. the likelihood distribution) is given by
[TABLE]
Let us now turn our attention to . Figure 1 depicts the -spectrograms for a clean signal and its reverberant version. As it can be observed, while the spectrogram of the clean signal is somewhat sparse, the one corresponding to the reverberant signal presents a smoother or more diffuse structure. The presence of discontinuities in the spectrogram of the clean signal can be favored by assuming follows a generalized Gaussian distribution ([25]). Namely,
[TABLE]
where is a prescribed parameter and is unknown.
In regards to , although no general conditions are expected on its individual components, we do expect its first order time differences to exhibit a certain degree of regularity (see Figures 2 and 3). In fact, if windows are set close enough relative to the duration of the reverberation phenomenon, then consecutive time components of will capture overlapped information, which along with the exponential decay characteristic of the RIR ([26]) accounts for a somewhat smooth structure. Therefore, we define the time differences matrix , with components . The regularity of these variations is contemplated by assuming follows a normal distribution:
[TABLE]
Using Bayes’ theorem, the a posteriori joint distribution of and conditioned to satisfies
[TABLE]
2.1 Mixed penalization
Our goal is to find and that are representative of the a posteriori distribution (5). Although the immediate instinct might be to compute the expected value, there are quite a few other ways to proceed, with different degrees of reliability and complexity. In lights of the assumed distributions and the high dimensionality of the problem, the maximum a posteriori (MAP) estimator is a reasonable choice in this case. Note that maximizing (5) is tantamount to minimizing . If we denote by , and the (transposed) rows of and , and define in such a way that , then
[TABLE]
where is a constant which does not depend on nor .
Finally, the latter equation leads to the cost function
[TABLE]
which shall be minimized to find our regularized solution. In this context, can be thought of as penalization parameters weighting both penalizers relative to the fidelity term, whereas the exponent is a tunning parameter. It is timely to point out that small values of will promote sparsity, whereas values close to will promote smoothness. Since there is a clear scale indeterminacy in the representation (3), we impose the (somewhat arbitrary) additional constraint , which means that the maximum values shall remain equal for every frequency.
2.2 Regularization parameters
As mentioned before, the parameters weight the penalizers against the fidelity term. In this sense, the optimal weights of these regularization parameters might vary as a function of the frequency subband, and hence their proposed dependency on . Since searching blindly for parameters is non-viable in practice, we quantify this dependency by defining and (note that the relation between and is already contemplated in the constraint that intends to avoid scale indeterminacy). This means we only need to look for two parameters () and then multiply by the energy of the signal associated to each row of .
Next, we present an algorithm for approximating matrices and minimizing .
3 Updating rules
We shall build an iterative algorithm following the idea in [16], which is based on the auxiliary function technique.
Let and . Then, is called an auxiliary function for if
[TABLE]
Let be arbitrary, and let
[TABLE]
With this definition, it can be shown ([27]) that the sequence is non-increasing. We intend to use this property as a tool for alternatively updating the matrices and . Let us begin by fixing , where is an arbitrary matrix. We will show that
[TABLE]
is an auxiliary function for (as defined in (6)) with respect to . From this point on, we denote by . The equality condition in (7) is rather straightforward. In fact,
[TABLE]
To prove condition in (7) we begin by defining
[TABLE]
and such that . With these definitions, we can write
[TABLE]
and
[TABLE]
Hence, to prove that it is sufficient to show that and . In fact,
[TABLE]
To prove that , we begin by noting that . Then, the first order necessary condition for yields
[TABLE]
meaning the only point at which the derivative of equals zero is at . Furthermore, , meaning that is the global minimum of . This yields
[TABLE]
In an analogous way, it can be shown that if we let be fixed, where is an arbitrary matrix, then
[TABLE]
is an auxiliary function for with respect to .
Having defined auxiliary functions, we will use the updating rule derived from (8) to build an algorithm for iteratively approaching matrices and minimizing . Notice this requires minimizing and with respect to the updating variables, but since is quadratic with respect to and is quadratic with respect to , we can simply use the first order necessary conditions in both cases. From this point on, in the context of the iterative updating process, and will refer not to arbitrary nonnegative matrices, but to those estimations of and obtained in the immediately previous step.
3.1 Updating rule for S
Firstly, we shall derive an updating rule for . That is, we wish to minimize w.r.t. . The first order necessary condition on yields
[TABLE]
which easily leads to the multiplicative updating rule
[TABLE]
In order to avoid the aforementioned scale indeterminacy, every updating step is to be followed by scaling so that its norm coincides with that of the observation .
3.2 Updating rule for H
In order to find an updating rule for , we shall write as a function of the transposed rows . We begin by noting
[TABLE]
Next, we define the diagonal matrices , whose diagonal elements are and , and the vector with components . With these definitions, we can write
[TABLE]
Now, the first order necessary condition for with respect to is given by
[TABLE]
which readily leads to an updating rule consisting of solving the linear system
[TABLE]
Let us notice that under the assumption that the diagonal elements of and are strictly positive, and since is positive-semidefinite, is positive-definite, and hence the linear system has a unique solution. The assumption of is adequate, since these elements correspond to the discrete convolution of and . Although the validity of the hypothesis over is not so clear, in practice, the matrix in system (11) has turned out to be non-singular. Nonetheless, can be computed as the best approximate solution in the least-squares sense. Then, solving this linear system entails no challenge, since is usually chosen relatively small, depending on the window step and the reverberation time.
All the steps for the dereverberation process are stated in Algorithm 1. Note that in the initialization we define the clean spectrogram equal to the observation, which is natural since in a way they both correspond to the same signal, and as a vector with exponential time decay, which is an expected characteristic of a RIR. Finally, we set the stopping criterion over the decay of the norm of two consecutive approximations of . This has shown to work quite well, although other stopping criteria might be considered.
Results to illustrate the performance of the algorithm are presented in the next section.
4 Experimental results
For the experiments, we took speech signals from the TIMIT database ([28]), recorded at 16 KHz, and artificially made them reverberant by convolution with impulse responses generated with the software Room Impulse Response Generator111https://github.com/ehabets/RIR-Generator, based on the model in [29]. Each signal was degraded under different reverberation conditions: three different room sizes, each with three different microphone positions and four different reverberation times.222A web demo for our algorithm can be found in http://fich.unl.edu.ar/sinc/web-demo/blindder/
In order to avoid preprocessing, the choice of the regularization parameters was made a priori by means of empirical rules, based upon signals from a different database. This is supported by the fact that the parameters were observed to be rather robust with respect to variations of the reverberation conditions, and hence they were chosen simply as and . The rest of the model parameters were chosen as specified in Table 1.
Let us point out that the choice of was done as to allow to capture early reverberation while precluding overlapped representations. In the first place, it is desirable for to represent the RIR along the full Early Decay Time (EDT), the time period in which the reverberation phenomenon alters the clean signal the most, so its effect can be effectively nullified. On the other hand, if we were to choose too large, it might lead certain similarities in the observation within a fixed frequency range to be represented as echoes from high energy components of . It is worth mentioning, however, that the performance of our dereverberation method has shown no high sensitivity with respect to the choice of .
In order to evaluate the performance of our model, we made comparisons against two state of the art methods that work under the same conditions. The one proposed by Kameoka et al in [16], choosing all the parameters as suggested, and the one proposed by Wisdom et al in [12], with a window length of .
To measure performance, following [30], we made use of the frequency weighted segmental signal-to-noise ratio (fwsSNR) and cepstral distance. Furthermore, we also measured the speech-to-reverberation modulation energy ratio (SRMR, [31]), which has the advantage of being non-intrusive (it does not use the clean signal as an input). The results for each performance measure are stated in Tables 2-4 and depicted in Figures 4- 6, classified in function of the reverberation times: , , and . Notice that for the cases of fwsSNR and SRMR, higher values correspond to better performance, while for the cepstral distance, small values indicate higher quality.
In regard to the fwsSNR performance measure, the values in Table 2 (Figure 4) give account of significant improvements of our proposed method with respect to the other two. This improvement becomes more evident as the reverberation time increases. As for the cepstral distance, although the results in Table 3 (Figure 5) account for a better performance of our proposed method, the quality with respect to the reverberant signal is improved only for reverberation times of 450[ms] or more. Finally, the SRMR also shows an improvement with respect to the other methods for reverberation times of 450[ms] or greater (see Table 4, Figure 6).
5 Conclusions
In this work, a new blind dereverberation method for speech signals based on regularization over a convolutive NMF representation of the signal spectrograms was introduced and tested. Results show a significant improvement over the state of the art methods, specially for high reverberation times. There is certainly much room for improvement, e.g. finding ways of optimally choosing the regularization parameters, exploring the use of other penalizers, etc.
Acknowledgements
This work was supported in part by Consejo Nacional de Investigaciones Científicas y Técnicas, CONICET through PIP 2014-2016 N 11220130100216-CO, the Air Force Office of Scientific Research, AFOSR/SOARD, through Grant FA9550-14-1-0130, by Universidad Nacional del Litoral, UNL, through CAID-UNL 2011 N 50120110100519 “Procesamiento de Señales Biomédicas.” and CAI+D-UNL 2016, PIC 50420150100036LI “Problemas Inversos y Aplicaciones a Procesamiento de Señales e Imágenes”.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] M. Kim, H.-M. Park, Efficient online target speech extraction using DOA-constrained independent component analysis of stereo data for robust speech recognition, Signal Processing 117 (2015) 126–137.
- 2[2] S. Yun, Y. J. Lee, S. H. Kim, Multilingual speech-to-speech translation system for mobile consumer devices, IEEE Transactions on Consumer Electronics 60 (3) (2014) 508–516.
- 3[3] R. Neßelrath, M. M. Moniri, M. Feld, Combining speech, gaze, and micro-gestures for the multimodal control of in-car functions, in: Intelligent Environments (IE), 2016 12th International Conference on, IEEE, 2016, pp. 190–193.
- 4[4] L. Di Persia, D. Milone, H. L. Rufiner, M. Yanagida, Perceptual evaluation of blind source separation for robust speech recognition, Signal Processing 88 (10) (2008) 2578–2583.
- 5[5] C. E. Martínez, J. Goddard, L. E. Di Persia, D. H. Milone, H. L. Rufiner, Denoising sound signals in a bioinspired non-negative spectro-temporal domain, Digital Signal Processing 38 (2015) 22–31.
- 6[6] L. Di Persia, D. Milone, M. Yanagida, Indeterminacy free frequency-domain blind separation of reverberant audio sources., IEEE Trans. Audio, Speech and Lang. Proc. 17 (2) (2009) 299–311.
- 7[7] L. E. Di Persia, D. H. Milone, Using multiple frequency bins for stabilization of FD-ICA algorithms, Signal Processing 119 (2016) 162–168.
- 8[8] A. Tsilfidis, J. Mourjopoulos, Signal-dependent constraints for perceptually motivated suppression of late reverberation, Signal Processing 90 (3) (2010) 959–965.
