ML Estimation and CRBs for Reverberation, Speech and Noise PSDs in Rank-Deficient Noise-Field
Yaron Laufer, Bracha Laufer-Goldshtein, Sharon Gannot

TL;DR
This paper develops and analyzes maximum likelihood estimators for speech, reverberation, and noise PSDs in reverberant environments with rank-deficient noise, providing CRBs and demonstrating improved estimation accuracy.
Contribution
It introduces two novel closed-form ML estimators for jointly estimating PSDs in rank-deficient noise fields, along with their analytical comparison and CRB derivation.
Findings
Proposed estimators outperform existing methods in simulations.
Derived CRBs provide theoretical performance benchmarks.
Validated estimators on real reverberant and noisy signals.
Abstract
Speech communication systems are prone to performance degradation in reverberant and noisy acoustic environments. Dereverberation and noise reduction algorithms typically require several model parameters, e.g. the speech, reverberation and noise power spectral densities (PSDs). A commonly used assumption is that the noise PSD matrix is known. However, in practical acoustic scenarios, the noise PSD matrix is unknown and should be estimated along with the speech and reverberation PSDs. In this paper, we consider the case of rank-deficient noise PSD matrix, which arises when the noise signal consists of multiple directional interference sources, whose number is less than the number of microphones. We derive two closed-form maximum likelihood estimators (MLEs). The first is a non-blocking-based estimator which jointly estimates the speech, reverberation and noise PSDs, and the second is a…
| Symbol | Meaning | |
|---|---|---|
| transpose | ||
| conjugate transpose | ||
| complex conjugate | ||
| determinant of a matrix | ||
| trace of a matrix | ||
| Kronecker product | ||
| stacking the columns of a matrix on top of | ||
| one another | ||
| No. of microphones, | ||
| No. of time frames, | ||
| No. of frequency bins, | ||
| No. of noise sources, | ||
| Observation signal | ||
| Direct speech component | ||
| Relative direct-path transfer function | ||
| Late reverberation component | ||
| Noise signal | ||
| Noise ATF matrix | ||
| Vector of noise sources | ||
| Speech PSD | ||
| Reverberation PSD | ||
| Reverberation coherence matrix | ||
| Noise PSD matrix |
| [dB] | [dB] | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Alg.\RSNR | ||||||||||||||
| Blocking LS [15] Dir | 0.56 | 0.70 | 0.77 | 0.76 | 11.85 | 11.02 | 9.80 | 8.67 | -0.55 | -0.52 | -0.45 | -0.37 | ||
| Blocking LS [15] DD | 0.60 | 0.75 | 0.85 | 0.88 | 13.05 | 11.67 | 10.23 | 8.88 | -0.50 | -0.45 | -0.36 | -0.27 | ||
| Blocking ML Dir | 0.74 | 0.88 | 0.93 | 0.90 | 12.56 | 11.77 | 10.61 | 9.39 | -0.74 | -0.65 | -0.52 | -0.40 | ||
| Blocking ML DD | 0.79 | 0.93 | 0.99 | 0.99 | 15.57 | 13.16 | 10.75 | 8.66 | -0.67 | -0.50 | -0.30 | -0.12 | ||
| Non-blocking LS [15] Dir | 0.70 | 0.81 | 0.85 | 0.81 | 13.52 | 11.69 | 9.93 | 8.57 | -0.54 | -0.49 | -0.41 | -0.33 | ||
| Non-blocking LS [15] DD | 0.67 | 0.79 | 0.89 | 0.89 | 13.56 | 11.82 | 10.03 | 8.57 | -0.55 | -0.48 | -0.38 | -0.28 | ||
| Non-blocking ML Dir | 0.98 | 1.12 | 1.15 | 1.12 | 15.59 | 13.69 | 11.67 | 9.77 | -0.74 | -0.61 | -0.46 | -0.32 | ||
| Non-blocking ML DD | 0.79 | 0.93 | 0.99 | 0.99 | 15.57 | 13.16 | 10.75 | 8.66 | -0.67 | -0.50 | -0.30 | -0.12 | ||
| [dB] | [dB] | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Alg.\RSNR | ||||||||||||||
| Blocking LS [15] Dir | 0.42 | 0.50 | 0.54 | 0.56 | 10.78 | 10.13 | 9.20 | 8.38 | -0.51 | -0.49 | -0.44 | -0.37 | ||
| Blocking LS [15] DD | 0.46 | 0.54 | 0.62 | 0.67 | 12.49 | 11.54 | 10.46 | 9.47 | -0.45 | -0.41 | -0.36 | -0.30 | ||
| Blocking ML Dir | 0.51 | 0.59 | 0.63 | 0.63 | 10.10 | 9.57 | 8.68 | 7.95 | -0.58 | -0.53 | -0.46 | -0.38 | ||
| Blocking ML DD | 0.61 | 0.71 | 0.78 | 0.80 | 14.93 | 13.10 | 11.30 | 9.67 | -0.57 | -0.44 | -0.28 | -0.13 | ||
| Non-blocking LS [15] Dir | 0.56 | 0.60 | 0.63 | 0.62 | 13.22 | 11.49 | 9.93 | 8.70 | -0.51 | -0.48 | -0.42 | -0.36 | ||
| Non-blocking LS [15] DD | 0.51 | 0.59 | 0.67 | 0.71 | 13.37 | 12.07 | 10.66 | 8.73 | -0.51 | -0.47 | -0.40 | -0.32 | ||
| Non-blocking ML Dir | 0.69 | 0.78 | 0.80 | 0.81 | 13.88 | 12.39 | 10.81 | 9.47 | -0.62 | -0.52 | -0.41 | -0.30 | ||
| Non-blocking ML DD | 0.61 | 0.71 | 0.78 | 0.80 | 14.93 | 13.10 | 11.30 | 9.67 | -0.57 | -0.44 | -0.28 | -0.13 | ||
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
ML Estimation and CRBs for Reverberation, Speech and Noise PSDs
in Rank-Deficient Noise Field
Yaron Laufer, , Bracha Laufer-Goldshtein, , and Sharon Gannot The authors are with the Faculty of Engineering, Bar-Ilan University, Ramat-Gan, 5290002, Israel (e-mail: [email protected], [email protected], [email protected]).
Abstract
Speech communication systems are prone to performance degradation in reverberant and noisy acoustic environments. Dereverberation and noise reduction algorithms typically require several model parameters, e.g. the speech, reverberation and noise power spectral densities (PSDs). A commonly used assumption is that the noise power spectral density (PSD) matrix is known. However, in practical acoustic scenarios, the noise PSD matrix is unknown and should be estimated along with the speech and reverberation PSDs. In this paper, we consider the case of rank-deficient noise PSD matrix, which arises when the noise signal consists of multiple directional noise sources, whose number is less than the number of microphones. We derive two closed-form maximum likelihood estimators. The first is a non-blocking-based estimator which jointly estimates the speech, reverberation and noise PSDs, and the second is a blocking-based estimator, which first blocks the speech signal and then jointly estimates the reverberation and noise PSDs. Both estimators are analytically compared and analyzed, and mean square errors expressions are derived. Furthermore, Cramér-Rao Bounds on the estimated PSDs are derived. The proposed estimators are examined using both simulation and real reverberant and noisy signals, demonstrating the advantage of the proposed method compared to competing estimators.
Index Terms:
Dereverberation, Noise reduction, Maximum likelihood estimation, Cramér-Rao Bound.
I Introduction
In many hands-free scenarios, the measured microphone signals are corrupted by additive background noise, which may originate from both environmental sources and from microphone responses. Apart from noise, if the recording takes place in an enclosed space, the recorded signals may also contain multiple sound reflections from walls and other objects in the room, resulting in reverberation. As the level of noise and reverberation increases, the perceived quality and intelligibility of the speech signal deteriorate, which in turn affect the performance of speech communication systems, as well as automatic speech recognition (ASR) systems.
In order to reduce the effects of reverberation and noise, speech enhancement algorithms are required, which aim at recovering the clean speech source from the recorded microphone signals. Speech dereverberation and noise reduction algorithms often require the power spectral densities (PSDs) of the speech, reverberation and noise components. In the multichannel framework, a commonly used assumption (see e.g. in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) is that the late reverberant signal is modelled as a spatially homogeneous sound field, with a time-varying PSD multiplied by a spherical diffuse time-invariant spatial coherence matrix. As the spatial coherence matrix depends only on the microphone geometry, it can be calculated in advance. However, the reverberation PSD is an unknown parameter that should be estimated. Numerous methods exist for estimating the reverberation PSD. They are broadly divided into two classes, namely non-blocking-based estimators and blocking-based estimators. The non-blocking-based approach jointly estimates the PSDs of the late reverberation and speech. The estimation is carried out using the maximum likelihood (ML) criterion [3, 6] or in the least-squares (LS) sense, by minimizing the Frobenius norm of an error PSD matrix [8]. In the blocking-based method, the desired speech signal is first blocked using a blocking matrix (BM), and then the reverberation PSD is estimated. Estimators in this class are also based on the ML approach [1, 5, 7, 9] or the LS criteria [2, 4].
All previously mentioned methods do not include an estimator for the noise PSD. In [1, 3], a noiseless scenario is assumed. In [2, 4, 5, 6, 8, 7, 9, 10], the noise PSD matrix is assumed to be known in advance, or that an estimate is available. Typically, the noise PSD matrix is assumed to be time-invariant, and therefore can be estimated during speech-absent periods using a voice activity detector (VAD). However, in practical acoustic scenarios the spectral characteristics of the noise might be time-varying, e.g. when the noise environment includes a background radio or TV, and thus a VAD-based algorithm may fail. Therefore, the noise PSD matrix has to be included in the estimation procedure.
Some papers in the field deal with performance analysis of the proposed estimators. We give a brief review of the commonly used tools to assess the quality of an estimator. Theoretical analysis of estimators typically consists of calculating the bias and the mean square error (MSE), which coincides with the variance for unbiased estimators. The Cramér-Rao Bound (CRB) is an important tool to evaluate the quality of any unbiased estimator, since it gives a lower bound on the MSE. An estimator that is unbiased and attains the CRB, is called efficient. The maximum likelihood estimator (MLE) is asymptotically efficient [11], namely attains the CRB when the amount of samples is large.
Theoretical analysis of PSD estimators in the noise-free scenario was addressed in [12, 13]. In [12], CRBs were derived for the reverberation and the speech MLEs proposed in [3]. These MLEs are efficient, i.e. attain the CRB for any number of samples. In addition, it was pointed out that the non-blocking-based reverberation MLE derived in [3] is identical to the blocking-based MLE proposed in [1]. In [13], it was shown that the non-blocking-based reverberation MLE of [3] obtains a lower MSE compared to a noiseless version of the blocking-based LS estimator derived in [2].
In the noisy case, quality assessment was discussed in [9] and [14]. In [9], it was numerically demonstrated that an iterative blocking-based MLE yields lower MSE than the blocking-based LS estimator proposed in [2]. In [14], closed-form CRBs were derived for the two previously proposed MLEs of the reverberation PSD, namely the blocking-based estimator in [5] and the non-blocking-based estimator in [6]. The CRB for the non-blocking-based reverberation estimator was shown to be lower than the CRB for the blocking-based estimator. However, it was shown that in the noiseless case, both reverberation MLEs are identical and both CRBs coincide.
As opposed to previous works, the assumption of known noise PSD matrix is not made in [15]. The noise is modelled as a spatially homogeneous sound field, with a time-varying PSD multiplied by a time-invariant spatial coherence matrix. It is assumed that the spatial coherence matrix of the noise is known in advance, while the time-varying PSD is unknown. Two different estimators were developed, based on the LS method. In the first one, a joint estimator for the speech, noise and late reverberation PSDs was developed. As an alternative, a blocking-based estimator was proposed, in which the speech signal is first blocked by a BM, and then the noise and reverberation PSDs are jointly estimated. This method was further extended in [16] to jointly estimate the various PSDs along with the relative early transfer function (RETF) vector, using an alternating least-squares method. However, the model used in [15] and [16] only fits spatially homogeneous noise fields that are characterized by a full-rank covariance matrix. Moreover, in [9] it was claimed that the ML approach is preferable over the LS estimation procedure.
Recently, a confirmatory factor analysis was used in [17] to jointly estimate the PSDs and the RETF. The noise is modelled as a microphone-self noise, with a time-invariant diagonal PSD matrix. A closed-form solution is not available for this case, thus requiring iterative optimization techniques.
In this paper, we treat the noise PSD matrix as an unknown parameter. We assume that the noise PSD matrix is a rank-deficient matrix, as opposed to the spatially homogeneous assumption considered in [15]. This scenario arises when the noise signal consists of a set of directional noise sources, whose number is smaller than the number of microphones. We assume that the positions of the noise sources are fixed, while their spectral PSD matrix is time-varying, e.g. when the acoustic environment includes radio or TV. It should be emphasized that, in contrast to [15] which estimates only a scalar PSD of the noise, in our model the entire spectral PSD matrix of the noise is estimated, and thus the case of multiple non-stationary noise sources, can be handled. We derive closed-form MLEs of the various PSDs, for both the non-blocking-based and the blocking-based methods. The proposed estimators are analytically studied and compared, and the corresponding MSEs expressions are derived. Furthermore, CRBs for estimating the various PSDs are derived.
An important benefit of considering the rank-deficient noise as a separated problem, is due to the form of the solution. In the ML framework, a closed-form solution exists for the noiseless case [1, 3] but not for the full-rank noise scenario, thus requiring iterative optimization techniques [5, 6, 9] (as opposed to LS method that has closed-form solutions in both cases). However, we show here that when the noise PSD matrix is a rank-deficient matrix, closed-form MLE exists, which yields simpler and faster estimation procedure with low computational complexity, and is not sensitive to local maxima.
The remainder of the paper is organized as follows. Section II presents the problem formulation, and describes the probabilistic model. Section III derives the MLEs for both the non-blocking-based and the blocking-based methods, and Section IV presents the CRB derivation. Section V demonstrates the performance of the proposed estimators by an experimental study based on both simulated data and recorded room impulse responses. The paper is concluded in Section VI.
II Problem Formulation
In this section, we formulate the dereverberation and noise reduction problem. Scalars are denoted with regular lowercase letters, vectors are denoted with bold lowercase letters and matrices are denoted with bold uppercase letters. A list of notations used in our derivations is given in Table I.
II-A Signal Model
Consider a speech signal received by microphones, in a noisy and reverberant acoustic environment. We work with the short-time Fourier transform (STFT) representation of the measured signals. Let denote the frequency bin index, and denote the time frame index. The -channel observation signal writes
[TABLE]
where denotes the direct and early reverberation speech component, denotes the late reverberation speech component and denotes the noise. The direct and early reverberation speech component is given by , where is the direct and early speech component as received by the first microphone (designated as a reference microphone), and is the time-invariant relative early transfer function (RETF) vector between the reference microphone and all microphones. In this paper, we follow previous works in the field, e.g. [2, 13, 14, 15], and neglect the early reflections. Thus, the target signal is approximated as the direct component at the reference microphone, and reduces to the relative direct-path transfer function (RDTF) vector. It is assumed that the noise signal consists of noise sources, i.e.
[TABLE]
where denotes the vector of noise sources and is the noise acoustic transfer function (ATF) matrix, assumed to be time-invariant. It is further assumed that .
II-B Probabilistic Model
The speech STFT coefficients are assumed to follow a zero-mean complex Gaussian distribution111 The multivariate complex Gaussian probability density function (PDF) is given by \mathcal{N}_{c}(\mathbf{a};\bm{\mu}_{a},\bm{\Phi}_{a})=\frac{1}{|\pi\bm{\Phi}_{a}|}\exp\big{(}-(\mathbf{a}-\bm{\mu}_{a})^{\textrm{H}}\bm{\Phi}_{a}^{-1}(\mathbf{a}-\bm{\mu}_{a})\big{)}, where is the mean vector and is an Hermitian positive definite complex covariance matrix [18].
with a time-varying PSD . Hence, the PDF of the speech writes:
[TABLE]
The late reverberation signal is modelled by a zero-mean complex multivariate Gaussian distribution:
[TABLE]
The reverberation PSD matrix is modelled as a spatially homogeneous and isotropic sound field, with a time-varying PSD, . It is assumed that the time-invariant coherence matrix can be modelled by a spherically diffuse sound field [19]:
[TABLE]
where , is the inter-distance between microphones and , denotes the sampling frequency and is the sound velocity.
The noise sources vector is modelled by a zero-mean complex multivariate Gaussian distribution with a time-varying PSD matrix :
[TABLE]
Using (2) and (6), it follows that has a zero-mean complex multivariate Gaussian distribution with a PSD matrix , given by
[TABLE]
Note that , i.e. is a rank-deficient matrix. The PDF of writes
[TABLE]
where is the PSD matrix of the input signals. Assuming that the components in (1) are independent, is given by
[TABLE]
A commonly used dereverberation and noise reduction technique is to estimate the speech signal using the multichannel minimum mean square error (MMSE) estimator, which yields the multichannel Wiener filter (MCWF), given by [20]:
[TABLE]
where
[TABLE]
denotes the total interference PSD matrix. For implementing (10), we assume that the RDTF vector and the spatial coherence matrix are known in advance. The RDTF depends only on the direction of arrival (DOA) of the speaker and the geometry of the microphone array, and thus it can be constructed based on a DOA estimate. The spatial coherence matrix is calculated using (5), based on the spherical diffuseness assumption. It follows that estimators of the late reverberation, speech and noise PSDs are required for evaluating the MCWF.
It should be noted that estimating directly the complete time-varying noise PSD matrix , along with the speech and reverberation PSDs, is a complex problem, and a closed-form ML solution is not available. In this paper, we rely on the decomposition of the noise PSD matrix in (7), along with the rank-deficiency assumption. By utilizing the time-invariant ATF matrix , we can construct a projection matrix onto the subspace orthogonal to the noise subspace, which generates nulls towards the noise sources. In the following section, we show that this step enables the derivation of closed-form MLEs for the various PSDs.
The noise ATF matrix is in general not available, since the estimation of the individual ATFs requires that each noise is active separately. Instead, we only assume the availability of a speech-absent segment, denoted by , in which all noise sources are concurrently active. Based on this segment, we compute a basis for the noise subspace that can be used instead of the unknown ATF matrix. To this end, we compute the noise PSD matrix , and apply the eigenvalue decomposition (EVD) to the resulting matrix. Recall that the rank of is . Accordingly, a -rank representation of the noise PSD matrix is formed by the computed eigenvalues and eigenvectors:
[TABLE]
where is the eigenvalues matrix (comprised of the non-zero eigenvalues) and is the corresponding eigenvectors matrix. is a basis that spans the noise ATFs subspace, and thus [21]
[TABLE]
where consists of projection coefficients of the original ATFs on the basis vectors. Substituting (13) into (2) and then into (1), yields
[TABLE]
where . It follows that the noise PSD matrix in (7) can be recast as
[TABLE]
where . Using this basis change, the MCWF in (10) is now computed with
[TABLE]
As a result, rather than requiring the knowledge of the exact noise ATF matrix, we use that is learned from a speech-absent segment. Due to this basis change, we will need to estimate instead of .
To summarize, estimators of , and are required. For the sake of brevity, the frame index and the frequency bin index are henceforth omitted whenever possible.
III ML Estimators
We propose two ML-based methods: (i) Non-blocking-based estimation: Simultaneous ML estimation of the reverberation, speech and noise PSDs; and (ii) Blocking-based estimation: Elimination of the speech PSD using a blocking matrix (BM), and then joint ML estimation of the reverberation and noise PSDs. Both methods are then compared and analyzed.
III-A Non-Blocking-Based Estimation
We start with the joint ML estimation of the reverberation, speech and noise PSDs. Based on the short-time stationarity assumption [12, 9], it is assumed that the PSDs are approximately constant across small number of consecutive time frames, denoted by . We therefore denote as the concatenation of successive observations of :
[TABLE]
The set of unknown parameters is denoted by , where is a vector containing all elements of , i.e. . Assuming that the consecutive signals in are i.i.d., the PDF of writes (see e.g. [14]):
[TABLE]
where is the sample covariance matrix, given by
[TABLE]
The MLE of the set is therefore given by
[TABLE]
To the best of our knowledge, for the general noisy scenario this problem is considered as having no closed-form solution. However, we will show that when the noise PSD matrix is rank-deficient, with , a closed-form solution exists. In the following, we present the proposed estimators. The detailed derivations appear in the Appendices.
In Appendix A, it is shown that the MLE of is given by:
[TABLE]
where is given by
[TABLE]
and is the speech-plus-noise subspace
[TABLE]
The matrix is a projection matrix onto the subspace orthogonal to the speech-plus-noise subspace. The role of is to block the directions of the desired speech and noise signals, in order to estimate the reverberation level.
Once we obtain the MLE for the late reverberation PSD, the MLEs for the speech and noise PSDs can be computed. In Appendix B, it is shown that the MLE for the speech PSD writes
[TABLE]
where is a minimum variance distortionless response (MVDR) beamformer that extracts the speech signal while eliminating the noise, given by
[TABLE]
and is a projection matrix onto the subspace orthogonal to the noise subspace, given by
[TABLE]
Note that
[TABLE]
The estimator in (24) can be interpreted as the variance of the noisy observations minus the estimated variance of the reverberation, at the output of the MVDR beamformer [12].
In Appendix C, it is shown that the MLE of the noise PSD can be computed with
[TABLE]
where is a multi-source linearly constrained minimum variance (LCMV) beamformer that extracts the noise signals while eliminating the speech signal:
[TABLE]
and is a projection matrix onto the subspace orthogonal to the speech subspace, given by
[TABLE]
Note that
[TABLE]
Interestingly, the projection matrix can be recast as a linear combination of the above beamformers (see Appendix D):
[TABLE]
Using (27), (31) and (32), it can also be noted that is orthogonal to both beamformers
[TABLE]
In the noiseless case, i.e. when , reduces to
[TABLE]
where , leading to the same closed-form estimators as in [3, Eq. (7)]:
[TABLE]
III-B Blocking-Based Estimation
As a second approach, we first block the speech component using a BM, and then jointly estimate the PSDs of the reverberation and noise. Let denote the BM, which satisfies . The output of the BM is given by
[TABLE]
The PDF of therefore writes:
[TABLE]
with the PSD matrix given by
[TABLE]
where is the total interference matrix, defined in (16). Multiplying (15) from left by and from right by , the noise PSD matrix at the output of the BM writes
[TABLE]
where is the reduced noise subspace:
[TABLE]
Substituting (16) into (39) and using (41), yields
[TABLE]
Under this model, the parameter set of interest is . Let be defined similarly to in (17), as a concatenation of i.i.d. consecutive frames. Similarly to , it is assumed that is fixed during the entire segment. The PDF of therefore writes
[TABLE]
where is given by
[TABLE]
The MLE of is obtained by solving:
[TABLE]
To the best of our knowledge, this problem is also considered as having no closed-form solution. Again, we argue that if the noise PSD matrix satisfies , then we can obtain a closed-form solution. In Appendix E, the following MLE is obtained:
[TABLE]
where is given by
[TABLE]
After the BM was applied, the remaining role of is to block the noise signals, in order to estimate the reverberation level. Note that .
Given , it is shown in Appendix F that the MLE for the noise PSD writes
[TABLE]
where is a multi-source LCMV beamformer, directed towards the noise signals after the BM, given by
[TABLE]
Note that with this notation, in (47) can be recast as
[TABLE]
Since , it also follows that
[TABLE]
Also, in Appendix G it is shown that
[TABLE]
namely the LCMV of (29), used in the non-blocking-based approach, can be factorized into two stages: The first is a BM that blocks the speech signal, followed by a modified LCMV, which recovers the noise signals at the output of the BM.
III-C Comparing the MLEs
In this section, the obtained blocking-based and non-blocking-based MLEs are compared. We will use the following identity, that is proved in [14, Appendix A]:
[TABLE]
Substituting (30) into (53) yields
[TABLE]
III-C1 Comparing the reverberation PSD estimators
First, we compare the reverberation PSD estimators in (21) and (46). Substituting (50) into (46) and then using (41), (44), (52) and (53), yields the following equation:
[TABLE]
Using (32) and noting that , yields (21). It follows that both estimators are identical:
[TABLE]
It should be noted that in [12, 14] the two MLEs of the reverberation PSD were shown to be identical in the noiseless case. Here we extend this result to the noisy case, when the noise PSD matrix is a rank-deficient matrix.
III-C2 Comparing the noise PSD estimators
The noise PSD estimators in (28) and (48) are now compared. Substituting (44) into (48) and then using (52) and (56), yields the same expression as in (28), and therefore
[TABLE]
III-D MSE* Calculation*
In the sequel, the theoretical performance of the proposed PSD estimators is analyzed. Since the non-blocking-based and the blocking-based MLEs were proved in section III-C to be identical for both reverberation and noise PSDs, it suffices to analyze the non-blocking-based MLEs.
III-D1 Theoretical performance of the reverberation PSD estimators
It is well known that for an unbiased estimator, the MSE is identical to the variance. We therefore start by showing that the non-blocking-based MLE in (21) is unbiased. Using (19), the expectation of (21) writes
[TABLE]
Then, we use the following property (see (84d)):
[TABLE]
to obtain
[TABLE]
It follows that the reverberation MLE is unbiased, and thus the MSE is identical to the variance. Using the i.i.d. assumption, the variance of the non-blocking-based MLE in (21) is given by
[TABLE]
In order to simplify (61), the following identity is used. For a positive definite Hermitian form , where and a Hermitian matrix, the variance is given by [11, p. 513, Eq. (15.30)]:
[TABLE]
Since \mathbf{y}(m)\sim\mathcal{N}_{c}\big{(}\bm{0},\bm{\Phi}_{\mathbf{y}}(m)\big{)} and is a Hermitian matrix (note that ), we obtain
[TABLE]
Finally, using (59) and (84b), the variance writes
[TABLE]
Note that in the noiseless case, namely , the variance reduces to the one derived in [12, 13].
III-D2 Theoretical performance of the noise PSD estimators
Using (19), (II-B) and (31), and based on the unbiasedness of , it can be shown that in (28) is an unbiased estimator of .
Next, we calculate the variance of the diagonal terms of . To this end, we write the entry of in (28) as
[TABLE]
for , where is the column of the matrix in (29). Using a partitioned matrix to simplify , it can be shown that
[TABLE]
where is composed of all the vectors in except , i.e. , and is the corresponding projection matrix onto the subspace orthogonal to
[TABLE]
It can be verified that . Denote the diagonal terms of by . In Appendix H, it is shown that
[TABLE]
where is defined as the noise-to-reverberation ratio at the output of :
[TABLE]
Note that
[TABLE]
III-D3 Theoretical performance of the speech PSD estimator
Using (19), (II-B) and (27) and based on the unbiasedness of , it can be shown that is an unbiased estimator of . In a similar manner to (III-D2), the variance of (24) can be shown to be
[TABLE]
where is defined as the signal-to-reverberation ratio at the output of :
[TABLE]
Note that
[TABLE]
In the noiseless case, i.e. , (III-D3) becomes identical to the variance derived in [12, 13].
IV CRB Derivation
In this section, we derive the CRB on the variance of any unbiased estimator of the various PSDs.
IV-A CRB* for the Late Reverberation PSD*
In Appendix I, it is shown that the CRB on the reverberation PSD writes
[TABLE]
The resulting CRB is identical to the MSE derived in (64), and thus the proposed MLE is an efficient estimator.
IV-B CRB* for the Speech and Noise PSDs*
The CRB on the speech PSD is identical to the MSE derived in (III-D3), as outlined in Appendix J. The CRB on the noise PSD can be derived similarly. We conclude that the proposed PSD estimators are efficient.
V Experimental Study
In this section, the proposed MLEs are evaluated in a synthetic Monte-Carlo simulation as well as on measurements of a real room environment. In Section V-A, a Monte-Carlo simulation is conducted in which signals are generated synthetically based on the assumed statistical model. The sensitivity of the proposed MLEs is examined with respect to the various model parameters, and the MSEs of the proposed MLEs are compared to the corresponding CRBs. In Section V-B, the proposed estimators are examined in a real room environment, by utilizing them for the task of speech dereverberation and noise reduction using the MCWF.
V-A Monte-Carlo Simulation
V-A1 Simulation Setup
In order to evaluate the accuracy of the proposed estimators, synthetic data was generated according to the signal model in (1), by simulating i.i.d. snapshots of single-tone signals, having a frequency of Hz. The signals are captured by a uniform linear array (ULA) with microphones, and inter-distance between adjacent microphones. The desired signal component was drawn according to a complex Gaussian distribution with a zero-mean and a PSD . The RDTF is given by
[TABLE]
where is the time difference of arrival (TDOA) w.r.t. the reference microphone, given by , and is the DOA, defined as the broadside angle measured w.r.t. the perpendicular to the array. The reverberation component was drawn according to a complex Gaussian distribution with a zero-mean and a PSD matrix , where is modelled as an ideal spherical diffuse sound field, given by . The noise component was constructed as where denotes the noise sources, drawn according to a zero-mean complex Gaussian distribution with a random PSD matrix , and is an random ATF matrix. For the estimation procedure, is extracted by applying the EVD to a set of noisy training samples, generated with different .
In the sequel, we examine the proposed estimators and bounds as a function of the model parameters. Specifically, the influence of the following parameters is examined: i) number of snapshots ; ii) reverberation PSD value ; iii) speech PSD value ; and iv) noise power , which is defined as the Frobenius norm of the noise PSD matrix, i.e. . In each experiment, we changed the value of one parameter, while keeping the rest fixed. The nominal values of the parameters are presented in Table II.
For each scenario, we carried out Monte-Carlo trials. The reverberation PSD was estimated in each trial with both (21) and (46), the noise PSD was estimated with both (28) and (48) and the speech PSD was estimated with (24). The accuracy of the estimators was evaluated using the normalized mean square error (nMSE), by averaging over the Monte-Carlo trials and normalizing w.r.t. the square of the corresponding PSD value. For each quantity, the corresponding normalized CRB was also computed, in order to demonstrate the theoretical lower bound on the nMSE.
V-A2 Simulation Results
In Fig. 1(a), the nMSEs are presented as a function of the number of snapshots, . Clearly, the nMSEs of all the estimators decrease as the number of snapshots increases. As expected from the analytical study, it is evident that the non-blocking-based and the blocking-based MLEs yield the same nMSE, for both the reverberation and noise PSDs. Furthermore, for all quantities the nMSEs coincide with the corresponding CRBs. It should be stressed that in practical applications of speech processing, the PSDs are highly time-varying, and thus will not remain constant over a large number of snapshots. This experiment only serves as a consistency check for the proposed estimators.
We now study the effect of varying the reverberation level. Let the signal-to-reverberation ratio (SRR) be defined as . In this experiment, we change s.t. the resulting SRR ranges between dB and dB. In Fig. 1(b), the nMSEs are presented as a function of SRR. It is evident that the nMSEs of the reverberation PSD estimators are independent of , and equal to , as manifested by (64). On the other hand, the nMSEs of the noise and speech MLEs decrease as the reverberation level decreases, until reaching the limiting value of , which is inline with (70) and (73), respectively.
We now examine the effect of changing the speech PSD level. Let the signal-to-reverberation-plus-noise ratio (SRNR) be defined as . We set to several values s.t. the SRNR ranges between dB and dB. In Fig. 1(c), the nMSEs are presented as a function of SRNR. It is shown that the speech PSD estimator is improved as the speech level increases. For the reverberation and noise PSD estimators, the performance is independent of . Obviously, the blocking-based estimators are not affected by the value of the speech PSD. It is further shown that the non-blocking-based reverberation and noise estimators also produce nMSEs that are independent of , as implied by (64) and (III-D2). This behaviour can be explained as follows. Both the reverberation and the noise PSD estimators apply a projection matrix onto the subspace orthogonal to the speech subspace, i.e. eliminate the speech component by generating a null towards its direction, as manifested by (22) and (29), respectively. As a result, they are not affected by any change in the speech level.
The effect of increasing the noise level is now examined. Let the signal-to-noise ratio (SNR) be defined as . We change s.t. the SNR varies between dB and dB. In Fig. 1(d), the nMSEs are presented as a function of SNR. It is evident that the performance of the noise PSD estimators degrades as decreases. In contrast, the nMSEs of the reverberation and speech MLEs are independent of the noise level, as manifested by (64) and (III-D3), respectively. This is due to the fact that the reverberation and speech MLEs eliminate the noise components by directing null towards the noise subspace, as implied by (22) and (25), respectively.
V-B Experiments with Recorded Room Impulse Responses
The performance of the proposed PSD estimators is now evaluated in a realistic acoustic scenario, for the task of speech dereverberation and noise reduction. In our experiments, microphone signals were synthesized using real speech signals and measured RIRs. The proposed PSD estimators were used in order to calculate the MCWF.
V-B1 Competing Algorithms
The proposed method is compared to [15], in which the MCWF is implemented using the blocking-based or the non-blocking-based LS estimators. Therein, a spatially homogeneous noise sound field is assumed, namely , where is a known time-invariant spatial coherence matrix, and denotes the unknown time-varying PSD, which has to be estimated, along with the speech and reverberation PSDs. Although this method considers a different noise model, it is chosen as the baseline since this is the only work that estimates also the noise PSD.
V-B2 Implementation of the MCWF
It is well-known that the MCWF can be decomposed into an MVDR beamformer followed by a single-channel Wiener postfilter [22, 23]:
[TABLE]
where
[TABLE]
denotes the SRNR at the output of the MVDR, and is the residual interference at the MVDR output.
The implementation of (77) requires the estimate of the speech PSD, which is missing in the blocking-based framework. By substituting the obtained blocking-based reverberation and noise estimates, namely and , into the general likelihood function in (III-A), the maximization becomes a one-dimensional optimization problem, and a closed-form solution is available [24, 9]:
[TABLE]
However, it was shown in [15] that rather than using (77), better dereverberation performance is obtained by using the decision-directed approach [25], where is estimated by
[TABLE]
where is a weighting factor, and is an instantaneous estimate based on the MVDR output [8]:
[TABLE]
In our experiments, the MCWF was implemented with the two variants of computing : i) The direct implementation in (77), which will be referred to as Dir; and ii) the decision-directed implementation in (79) with the speech PSD estimated as in (80), denoted henceforth as DD.
V-B3 Performance Measures
The speech quality was evaluated in terms of three common objective measures, namely perceptual evaluation of speech quality (PESQ) [26], frequency-weighted segmental SNR (fwSNRseg) [27] and log-likelihood ratio (LLR) [27]. The measures were computed by comparing with the direct speech signal as received by the reference microphone, obtained by filtering the anechoic speech with the direct path. In the sequel, we present the improvement in these performance measures, namely , and , computed as the measure difference between the output signal (i.e. ) and the noisy and reverberant signal at the reference microphone (i.e. ).
V-B4 Experimental Setup
We used RIRs from the database presented in [28]. The impulse responses were recorded in the m acoustic lab of the Engineering Faculty at Bar-Ilan University (BIU). The reverberation time was set to msec. The RIRs were recorded by a ULA of microphones with inter-distances of cm. A loudspeaker was positioned at a distance of m from the array, at angles , as illustrated in Fig. 2. The performance is evaluated on experiments with different speaker angles, which were randomly selected from the given set of angles. The chosen RIRs were convolved with 3-5 sec long utterances of male and female speakers that were drawn from the TIMIT database [29].
For the additive noise, we used non-stationary directional noise signals. Two non-stationary noise sources (with time-varying speech-like spectrum) were positioned at a distance of m from the array. In each experiment, the angles of the noise sources were randomly selected, avoiding the angle of the speaker. The microphone signals were synthesized by adding the noises to the reverberant speech signals with several reverberant signal-to-noise ratio (RSNR) levels. Finally, a sec noise segment was preceded to the reverberant speech signal. Based on this segment, a sample noise covariance matrix is computed, and then the EVD is applied in order to extract the spatial basis that replaces the noise ATF matrix (see (14)).
The following values of the parameters were used. The sampling rate was set to KHz, the STFT was computed with windows of msec and 75% overlap between adjacent time frames. As the experiment consists of real-life non-stationary speech signals, the sample covariance matrix was estimated using recursive averaging [9] with a smoothing factor of , rather than the moving-window averaging of (19). The same applies for . The smoothing parameter for the decision-directed in (79) was set to . The gain of the single channel postfilter was lower bounded by dB.
V-B5 Experimental Results
First, we examine the performance obtained with various RSNR levels. Tables III and IV summarize the results for both reverberation levels. Note that negative indicates a performance improvement. The best results are highlighted in boldface. It is evident that the proposed ML methods outperform the baseline methods of [15] for all the considered scenarios. The advantage of the proposed method can be attributed to the fact that the full noise PSD matrix is estimated in each frame, and thus the non-stationarity of all the noise sources can be tracked, whereas the baseline model of a time-invariant spatial coherence matrix with a single controllable gain parameter cannot track this dynamics.
In order to analyze the behaviour of the various ML implementations, recall that the non-blocking-based and the blocking-based noise and reverberation MLEs were proved to be identical in Section III-C. Since both the non-blocking ML DD and the blocking ML DD use also the same speech PSD estimate in (80), it follows that both methods yield the same performance. However, the non-blocking ML Dir and the blocking ML Dir implementations differ, due to the fact that they use different speech PSD estimators in (24) and (V-B2), respectively.
From Tables III and IV, it is evident that the non-blocking-based ML Dir method is preferable for its high and stable scores, as it obtains the best and scores in most cases, and also yields high results. The advantage of the non-blocking ML Dir over the blocking ML Dir can be attributed to the fact that the non-blocking ML Dir uses the speech PSD estimate in (24), which is independent of the estimated noise PSD. In contrast, the blocking-based ML Dir estimates the speech PSD with (V-B2), which depends on the estimated noise PSD, and thus can be affected by estimation errors of the noise PSD. Indeed, there is a slight disadvantage for the non-blocking ML Dir in terms of , which mainly quantifies the performance due to speech distortion. This can be related to the fact that the speech PSD estimate in (24) generates nulls towards the noise sources, which in some cases can affect speech distortion, for example when the speech and noise sources are not fully spatially separated.
Comparing the direct and the DD implementations of the non-blocking ML approach, it can be observed that there is a clear advantage for the direct implementation in most cases. This can be explained by the fact that the direct method optimally estimates the speech PSD by maximizing the ML criterion, while the decision-directed approach estimates the speech PSD using (79)–(80), which is derived in a more heuristic manner. However, in adverse conditions of high noise and reverberation levels, the DD has some advantage due to its use of a smoother estimator compared to the instantaneous estimator of the direct form, thus improving the robustness of the speech PSD estimate. To conclude, the non-blocking ML Dir is the preferable method among the proposed estimators.
Next, the performance is investigated as a function of the number of noise sources. A representative scenario with dB and was inspected. Since our model assumes that , the following values of were examined: . Figs. 3 and 4 depict the results for both reverberation levels, where NBB denotes the non-blocking-based method, and BB refers to the blocking-based method. It is shown that the proposed methods outperform the baseline methods [15] in most cases, which is inline with the results obtained in Tables III and IV, showing the advantage of the proposed rank-deficient noise model. The baseline methods have an advantage only for values of close to . This can be attributed to the fact that when the number of non-stationary noise sources increases, their sum tends to be stationary, especially in high reverberation conditions. In this case, it might be preferred to use the simpler approximation of the noise PSD matrix, namely a time-invariant spatial coherence matrix multiplied by a scalar time-varying PSD, rather than the proposed model that tracks the non-stationarity of each of the noises, and requires the estimation of a larger amount of parameters.
Finally, we test the influence of the number of microphones on the performance. Figs. 5 and 6 depict the measures obtained with different number of microphones, i.e. , for both reverberation times, where and dB. It is evident that the proposed methods outperform the baseline methods in almost all cases.
Fig. 7 depicts several sonogram examples of the various signals at msec and RSNR of dB. Fig. 7(a) shows , the direct speech signal as received by the first microphone. Fig. 7(b) depicts , the noisy and reverberant signal at the first microphone. Figs. 7(c) and 7(d) show the MCWF output computed with (77), using the proposed blocking-based and non-blocking-based MLEs, respectively. We conclude that the application of the MCWF, implemented based on the proposed MLEs, reduces significantly noise and reverberation, while maintaining low speech distortion.
VI Conclusions
In this contribution, we discussed the problem of joint dereverberation and noise reduction, in the presence of directional noise sources, forming a rank-deficient noise PSD matrix. As opposed to state-of-the-art methods which assume the knowledge of the noise PSD matrix, we propose to estimate also the time-varying noise PSD matrix, assuming that a basis that spans the noise ATF subspace is known. MLEs of the reverberation, speech and noise PSDs are derived for both the non-blocking-based and the blocking-based methods. The resulting MLEs are of closed-form and thus have low computational complexity. The proposed estimators are theoretically analyzed and compared. For both the reverberation and the noise PSD estimators, it is shown that the non-blocking-based MLE and the blocking-based MLE are identical. The estimators were shown to be unbiased, and the corresponding MSEs were calculated. Moreover, CRBs on the various PSDs were derived, and were shown to be identical to the MSEs of the proposed estimators. The discussion is supported by an experimental study based on both simulated data and real-life audio signals. It is shown that using the proposed estimators yields a large performance improvement with respect to a competing method.
Appendix A
The proof follows the lines of [24]. First, we use (15) and rewrite the PSD matrix of the microphone signals in (II-B) as
[TABLE]
where is defined in (23) and is given by
[TABLE]
Then, we define as an orthogonal projection matrix onto the speech-plus-noise subspace
[TABLE]
and is the orthogonal complement. We will make use of the following properties, which can be easily verified:
[TABLE]
Using [11, Eq. (15.47-15.48)], the derivative of the log of the likelihood function in (III-A) w.r.t. is given by
[TABLE]
Using (84a) and (84b) it follows that
[TABLE]
However, by substituting (83) into (A), it can be shown that the second term vanishes (follows from setting the derivative of the log likelihood w.r.t. in (B) to zero). Hence,
[TABLE]
where (a) follows from (84c) and (84d), and (b) follows from (84b) and (84c). Finally, using (22) we have . Thus, setting (A) to zero yields (21).
Appendix B
Using [24, Eq. (12)], the derivative of the log likelihood function \log p\big{(}\bar{\mathbf{y}}(m);\bm{\phi}(m)\big{)} w.r.t. is given by
[TABLE]
Using (84e) and setting the result to zero yields
[TABLE]
In order to simplify the expression of (B), we define a partitioned matrix
[TABLE]
Using the formula of the inverse of a partitioned matrix and then taking the entry of , the MLE of in (24) is obtained.
Appendix C
We note that is the lower-right block of the full matrix in (82). Using again the formula for the inverse of the partitioned matrix in (90) and taking the corresponding entries, yields (28).
Appendix D
The proof follows by substituting the inverse of the partitioned matrix (90) into (22), and then using the definitions in (25) and (29).
Appendix E
The PSD matrix of the BM output is given in (42) as
[TABLE]
where is defined in (41). The proof is now similar to that of Appendix A, with the following changes: , , and are replaced with , , and , respectively.
Appendix F
Using (42), the MLE of can be calculated in a similar manner to (B), with
[TABLE]
where is given in (49).
Appendix G
Substituting (41) into (49), and then using (54), leads to
[TABLE]
Right multiplying (93) by , using again (54) and then comparing to (29), yields (52).
Appendix H
Using (19), (21) and the i.i.d. assumption, the variance of (65) can be recast as
[TABLE]
To proceed, we use (62), (59), and (33) to obtain
[TABLE]
Finally, using (II-B) and (31) yields (III-D2). The right hand side of (69) is obtained by substituting (66) into the definition of , and noting that .
Appendix I
In this Appendix, the CRB on the reverberation PSD is derived. We will make use of the following identities[30]:
[TABLE]
Using the definitions of and in (81) and (82), we denote the set of unknown parameters by , where . As is a PSD matrix of a Gaussian vector, the Fisher information of each pair of parameters is given by [31]:
[TABLE]
where is the Fisher information of and and . In order to facilitate the derivation, we use (97) and vectorize (81) to obtain an vector:
[TABLE]
Using (96), (97), (99) and (100), the full Fisher information matrix (FIM) writes [32, 33]
[TABLE]
Next, we define the following partitioned matrix:
[TABLE]
Using (102) and (98), the FIM in (101) writes
[TABLE]
The CRB for is given by . Using the formula of the inverse of a partitioned matrix, the CRB writes
[TABLE]
where is a projection matrix onto the subspace orthogonal to :
[TABLE]
We now simplify (I). First, we use (100) along with (97) to write as
[TABLE]
Similarly, is simplified by using (100) along with (98),
[TABLE]
In order to calculate , the following identity is used [33]:
[TABLE]
Hence,
[TABLE]
where (a) follows by (I) and (I), (b) follows by (112) and (c) follows by (97). Left multiplying (I) by yields the reciprocal of the CRB in (I):
[TABLE]
where (a) follows by (I) and (b) follows by (96). To proceed, let us define as a matrix that spans the nullspace of the matrix s.t. , i.e. a blocking matrix of the speech-plus-noise signals. Since , it follows that [33]
[TABLE]
and thus
[TABLE]
where (a) and (b) follows since and (c) follows by (115). Substituting (I) into (I) and using the property , yields
[TABLE]
Substituting (117) into (I) yields (74).
Appendix J
We define the following partitioned matrix:
[TABLE]
where . Similarly to (107)–(I), we use (118) to construct the FIM, and then compute its inverse and take the corresponding component,
[TABLE]
We define a partition of as
[TABLE]
Using the blockwise formula for projection matrices [34],
[TABLE]
We note that can be further partitioned as
[TABLE]
In a similar manner to (121), we obtain
[TABLE]
Substituting (123) into (121) and then into (119), yields
[TABLE]
where , , , , , and , , . These quantities can be computed using similar techniques to those used in the derivation of (I)–(117). Due to space constraints, the detailed derivation is omitted. Collecting all the terms and substituting into (124), yields the same expression as the MSE in (III-D3).
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] U. Kjems and J. Jensen, “Maximum likelihood based noise covariance matrix estimation for multi-microphone speech enhancement,” in Proceedings of the 20nd European Signal Processing Conference (EUSIPCO) , 2012, pp. 295–299.
- 2[2] S. Braun and E. A. P. Habets, “Dereverberation in noisy environments using reference signals and a maximum likelihood estimator,” in Proceedings of the 21st European Signal Processing Conference (EUSIPCO) , 2013, pp. 1–5.
- 3[3] A. Kuklasinski, S. Doclo, S. H. Jensen, and J. Jensen, “Maximum likelihood based multi-channel isotropic reverberation reduction for hearing aids,” in Proceedings of the 22nd European Signal Processing Conference (EUSIPCO) , 2014, pp. 61–65.
- 4[4] S. Braun and E. A. Habets, “A multichannel diffuse power estimator for dereverberation in the presence of multiple sources,” EURASIP J. Audio, Speech, Music Process. , vol. 2015, no. 34, pp. 1–14, Dec. 2015.
- 5[5] O. Schwartz, S. Braun, S. Gannot, and E. A. Habets, “Maximum likelihood estimation of the late reverberant power spectral density in noisy environments,” in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) , 2015, pp. 1–5.
- 6[6] O. Schwartz, S. Gannot, and E. A. Habets, “Joint maximum likelihood estimation of late reverberant and speech power spectral density in noisy environments,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2016, pp. 151–155.
- 7[7] A. Kuklasinski, S. Doclo, and J. Jensen, “Maximum likelihood psd estimation for speech enhancement in reverberant and noisy conditions,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2016, pp. 599–603.
- 8[8] O. Schwartz, S. Gannot, and E. A. Habets, “Joint estimation of late reverberant and speech power spectral densities in noisy environments using frobenius norm,” in 24th European Signal Processing Conference (EUSIPCO) , 2016, pp. 1123–1127.
