A unified convolutional beamformer for simultaneous denoising and dereverberation
Tomohiro Nakatani, Keisuke Kinoshita

TL;DR
This paper introduces a unified convolutional beamformer called WPD that optimally combines denoising and dereverberation, significantly enhancing speech quality and recognition accuracy.
Contribution
It presents a novel unified beamformer that integrates dereverberation and denoising into a single optimized framework, improving over traditional sequential methods.
Findings
Substantial improvement in speech enhancement metrics.
Enhanced automatic speech recognition performance.
Effective integration of dereverberation and denoising.
Abstract
This paper proposes a method for estimating a convolutional beamformer that can perform denoising and dereverberation simultaneously in an optimal way. The application of dereverberation based on a weighted prediction error (WPE) method followed by denoising based on a minimum variance distortionless response (MVDR) beamformer has conventionally been considered a promising approach, however, the optimality of this approach cannot be guaranteed. To realize the optimal integration of denoising and dereverberation, we present a method that unifies the WPE dereverberation method and a variant of the MVDR beamformer, namely a minimum power distortionless response (MPDR) beamformer, into a single convolutional beamformer, and we optimize it based on a single unified optimization criterion. The proposed beamformer is referred to as a Weighted Power minimization Distortionless response (WPD)…
| SimData | RealData | ||||
|---|---|---|---|---|---|
| CD | SRMR | FWSSNR | SIIB | SRMR | |
| No Enh | 3.97 | 3.68 | 3.62 | 241.2 | 3.18 |
| WPE | 3.76 | 4.77 | 4.99 | 315.3 | 5.00 |
| MPDR | 3.67 | 4.50 | 4.66 | 312.4 | 4.82 |
| WPE+MPDR | 3.01 | 5.37 | 7.52 | 486.8 | 6.57 |
| Proposed | 2.64 | 5.34 | 8.18 | 521.7 | 6.64 |
| SimData | RealData | |||||
|---|---|---|---|---|---|---|
| Near | Far | Average | Near | Far | Average | |
| No Enh | 4.18 | 6.25 | 5.22 | 17.53 | 19.68 | 18.61 |
| WPE | 4.04 | 4.90 | 4.47 | 12.33 | 13.88 | 13.11 |
| MPDR | 3.81 | 4.65 | 4.23 | 10.60 | 13.81 | 12.20 |
| WPE+MPDR | 4.00 | 4.69 | 4.35 | 8.75 | 11.31 | 10.03 |
| Proposed | 3.60 | 3.95 | 3.78 | 7.86 | 10.67 | 9.27 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
A unified convolutional beamformer for simultaneous denoising and dereverberation
Tomohiro Nakatani, Keisuke Kinoshita T. Nakatani and K. Kinoshita are with NTT Communication Science Laboratories, NTT Corporation.Manuscript received December 19, 2018.
Abstract
This paper proposes a method for estimating a convolutional beamformer that can perform denoising and dereverberation simultaneously in an optimal way. The application of dereverberation based on a weighted prediction error (WPE) method followed by denoising based on a minimum variance distortionless response (MVDR) beamformer has conventionally been considered a promising approach, however, the optimality of this approach cannot be guaranteed. To realize the optimal integration of denoising and dereverberation, we present a method that unifies the WPE dereverberation method and a variant of the MVDR beamformer, namely a minimum power distortionless response (MPDR) beamformer, into a single convolutional beamformer, and we optimize it based on a single unified optimization criterion. The proposed beamformer is referred to as a Weighted Power minimization Distortionless response (WPD) beamformer. Experiments show that the proposed method substantially improves the speech enhancement performance in terms of both objective speech enhancement measures and automatic speech recognition (ASR) performance.
Index Terms:
Denoising, dereverberation, microphone array, speech enhancement, robust speech recognition
I Introduction
When a speech signal is captured by distant microphones, e.g., in a conference room, it will inevitably contain additive noise and reverberation components. These components are detrimental to the perceived quality of the observed speech signal and often cause serious degradation in many applications such as hands-free teleconferencing and automatic speech recognition (ASR).
Microphone array signal processing techniques have been developed to minimize the aforementioned detrimental effects by reducing the noise and the reverberation in the acquired signal. A filter-and-sum beamformer [1], a minimum-variance distortionless response (MVDR) beamformer and a minimum-power distortionless response (MPDR) beamformer [2, 3, 4, 5, 6], and a maximum signal-to-noise ratio beamformer [7, 8, 9] are widely-used systems for denoising, while a weighted prediction error (WPE) method and its variants [10, 11, 12, 13, 14] are emerging techniques for dereverberation. The usefulness of these techniques, particularly for improving ASR performance, has been extensively studied, e.g., at the REVERB challenge [15] and the CHiME-3/4/5 challenges [16, 17, 18]. Advances in this technological area have led to recent progress on commercial devices with far-field ASR capability, such as smart speakers [19, 20, 21].
However, it remains a challenge to reduce both noise and reverberation simultaneously in an optimal way. For example, researchers have proposed using MVDR beamforming and WPE dereverberation in a cascade manner [22, 23], where, for example, the signal is first processed by WPE dereverberation and then denoised with MVDR beamforming. With this approach, dereverberation may not be optimal due to the influence of the noise, and denoising may be disturbed by the remaining reverberation. Certain joint optimization techniques have also been proposed [24, 25, 26], but they perform dereverberation and denoising separately, which makes the optimality of the integration unclear, resulting in marginal performance improvement compared with the cascade system.
To achieve optimal integration, this paper proposes a method for unifying WPE dereverberation and MPDR beamforming, into a single convolutional beamforming approach and for optimizing the beamformer based on a single unified optimization criterion. We can derive a closed-form solution for this beamformer, assuming that the time-varying power and steering vector of the desired signal are given. The optimality of the beamformer is guaranteed under the assumed optimization criterion and condition. The beamformer is referred to as a Weighted Power minimization Distortionless response (WPD) beamformer. Note that the steering vector and the signal power must also be given for WPE dereverberation and MPDR beamforming, respectively, and several techniques for their estimation have already been proposed [25, 27, 28].
In the experiments, we compare the proposed method with WPE dereverberation, MPDR beamforming, and both approaches in a cascade configuration in terms of objective speech enhancement measures and ASR performance. The experiments show that the proposed method substantially outperforms all the conventional methods with regard to almost all the performance metrics. For example, in comparison with the cascade system, the proposed method achieves an average word error reduction rate of 7.5 % for real data taken from the REVERB Challenge dataset.
II Signal model
Assume that a single speech signal is captured by microphones in a noisy reverberant environment. Then, the captured signal in the short time Fourier transform (STFT) domain is approximately modeled at each frequency bin by
[TABLE]
where and are time frame indices. Note that all the symbols should also have frequency bin indices, but they are omitted for brevity in this paper assuming that each frequency bin is processed independently in the same way. Letting denote the non-conjugate transpose, is a column vector containing the STFT coefficients of the captured signals for all the microphones at a time frame , is an STFT coefficient of clean speech signal at a time frame , for is a sequence of column vectors containing convolutional acoustic transfer functions (ATFs) from the speaker location to all the microphones, is the length of the convolutional ATFs in each frequency bin, and is the additive noise. As in eq. (1), according to [29], the effect of the reverberation can be approximately represented by the convolution in the STFT domain between and when the length of the room impulse response in the time domain is longer than the analysis window. Hereafter, we refer to a sequence of STFT coefficients in each frequency bin, such as and for , simply as a signal.
The first term in eq. (1) can be further decomposed into two parts, one composed of a direct signal and early reflections, hereafter referred to as the desired signal , and the other corresponding to the late reverberation [30]. With this decomposition, eq. (1) is rewritten as
[TABLE]
where is the frame index that divides the convolutional ATFs into the ATF coefficients for and those for . Later, is also termed the prediction delay for WPE dereverberation and WPD beamforming. Finally, we define the goal of realizing speech enhancement to preserve while reducing and from .
III Conventional methods
This section gives a brief overview of the conventional methods, including WPE dereverberation, MPDR beamforming, and two approaches with a cascade configuration.
III-A Dereverberation by WPE
If we disregard the additive noise, , we can rewrite eq. (1) using a multichannel autoregressive model [31, 32, 10] as
[TABLE]
where is the regression order, denotes the conjugate transpose, for are dimensional matrices containing coefficients that predict the current captured signal, , from the past captured signals, for , and the second term in the equation, referred to as the prediction error, is assumed to be the desired signal according to the model [10].
WPE dereverberation estimates the prediction coefficients based on maximum likelihood estimation, assuming that the desired signal at each microphone follows a time-varying complex Gaussian distribution with a mean of zero and a time-varying variance, , which corresponds to the time-varying power of the desired signal. Then, the prediction coefficients, , are estimated as those that minimize the average power of the prediction error weighted by the inverse of . The estimation is represented by
[TABLE]
where is the squared norm of a vector . It is known that the prediction delay also works as a distortionless constraint to prevent the desired signal components from being distorted by the dereverberation [10]. As for the estimation of , several useful techniques have been proposed including an iterative estimation method [29, 13].
With the estimated prediction coefficients, the dereverberation is performed by
[TABLE]
It was experimentally confirmed that WPE dereverberation can function robustly even in noisy environments to reduce the late reverberation with a slight increase in the noise [10].
III-B Beamforming by MPDR
Assuming that the desired signal can be approximated as the product of a vector with a clean speech signal, i.e., , and taking the late reverberation, , as part of the noise, , eq. (2) becomes
[TABLE]
The MPDR beamformer is defined as a vector, , that minimizes the average power of the captured signal, , under a distortionless constraint, , that keeps the clean speech, , unchanged by the beamforming [2, 3]. Here, is also termed a steering vector, and techniques for its estimation from a captured signal have been proposed. Due to the scale ambiguity in the steering vector estimation, in practice it is substituted by a relative transfer function (RTF) [33]. An RTF is defined as the steering vector normalized by its value at a reference channel, calculated by where denotes the value at the reference channel. This makes the distortionless constraint work to keep the desired signal at the reference channel, , unchanged.
The beamformer is estimated as follows:
[TABLE]
The desired signal is then estimated as
[TABLE]
With the beamformer, the resultant signal is composed of only one channel signal corresponding to the reference channel .
On the basis of the above discussion, MPDR beamforming can perform both denoising and dereverberation [34] by reducing , which contains the additive noise and the late reverberation. However, its dereverberation capability is limited because it cannot reduce reverberation components that come from the target speaker direction, especially when there are few microphones.
III-C Cascade of WPE dereverberation and MPDR beamforming
To achieve better speech enhancement in noisy reverberant environments, researchers have proposed using both WPE dereverberation and MPDR beamforming in a cascade configuration [22]. Because WPE dereverberation can dereverberate all the microphone signals individually, MPDR beamforming can be applied after WPE dereverberation has been applied. Techniques have also been proposed for estimating the steering vector and the power of the desired signal, for example, by iteratively and alternately applying WPE dereverberation and MPDR beamforming to the signals [25].
IV Proposed method
This section describes a method for unifying WPE dereverberation and MPDR beamforming into a single convolutional beamforming approach. A closed-form solution can be obtained for the beamformer given the steering vector and the time-varying power of the desired signal, and we can perform more effective speech enhancement than with a simple cascade consisting of WPE dereverberation and MPDR beamforming. Figure 1 illustrates the processing flow of the method.
IV-A Convolutional beamforming by WPD
First, the signal obtained using the cascade consisting of WPE dereverberation and MPDR beamforming, i.e., eqs. (7) and (10), can be rewritten as
[TABLE]
where we set to obtain the second line above, and we set and to obtain the third line. Note that and contain a time gap between their first and the second elements, corresponding to the prediction delay .
Next, the optimization criterion is defined based on the model of the desired speech used for WPE dereverberation, namely the time-varying Gaussian distribution, and based on the distortionless constraint used for MPDR beamforming. Specifically, we estimate the convolutional filter, , as one that minimizes the average weighted power of a signal under a distortionless constraint. It is represented by
[TABLE]
Here, all the filter coefficients are optimized based on the average weighted power minimization criterion. Note that the use of the time-varying weight makes the distribution of the enhanced speech obtained by beamforming closer to that of the desired speech.
Eq. (14) can be viewed as a variation of eq. (9), which is used for conventional MPDR beamforming. Unlike eq. (9), eq. (14) evaluates the average weighted power of the signal, and considers both the spatial and temporal covariance. The solution is obtained as follows:
[TABLE]
where is a column vector containing followed by zeros, and is a power-normalized temporal-spatial covariance matrix with a prediction delay, which is defined as
[TABLE]
Finally, with the estimated convolutional filter, , the target speech is estimated as
[TABLE]
Interestingly, the same solution can be derived for the proposed method even when we concatenate MPDR beamforming and WPE dereverberation in reverse order. The signal obtained in this case becomes
[TABLE]
where is the MPDR beamformer applied to , is an arbitrary denoising matrix that contains in its first column, and is a coefficient vector that predicts the current denoised signal, , from the past denoised signals, . Then, eq. (12) is obtained by setting , and optimized in the way discussed above.
V Experiments
V-A Dataset and evaluation metrics
We evaluated the performance of the proposed method using the REVERB Challenge dataset [15]. The evaluation set (Eval set) of the dataset is composed of simulated data (SimData) and real recordings (RealData). Each utterance in the dataset contains reverberant speech uttered by a speaker and stationary additive noise. The distance between the speaker and the microphone array is varied from 0.5 m to 2.5 m. For SimData, the reverberation time is varied from about 0.25 s to 0.7 s, and the signal-to-noise ratio (SNR) is set at about 20 dB.
As objective measures for evaluating speech enhancement performance [35], we used the cepstrum distance (CD), the frequency-weighted segmental SNR (FWSSNR), the speech-to-reverberation modulation energy ratio (SRMR) [36], and the speech intelligibility in bits with the information capacity of a Gaussian channel (SIIB) [37]. SIIB is a recently proposed intrusive instrumental metric that is used to evaluate the intelligibility of distorted speech signals. To evaluate the enhanced speech in terms of ASR performance, we used a baseline ASR system recently developed using kaldi [38]. This is a fairly competitive system composed of a time-delay neural network acoustic model trained using a lattice-free maximum mutual information criterion and online i-vector extraction, and a tri-gram language model.
V-B Methods to be compared and analysis conditions
We compared WPD beamforming (Proposed) with WPE dereverberation, MPDR beamforming, and WPE dereverberation followed by MPDR beamforming (WPE+MPDR). For all the methods, a hanning window was used for a short time analysis with the frame length and the shift set at 32 ms and 8 ms, respectively. The sampling frequency was 16 kHz and microphones were used. For WPE dereverberation, WPE+MPDR, and WPD beamforming, the prediction delay was set at , and the order of the autoregressive model was set at , and , respectively, for frequency ranges of [math] to kHz, to kHz, to kHz, to kHz, and to kHz.
The time-varying power, , and the steering vector, were estimated from the captured signal based on a method used in [25]. Figure 2 shows the processing flow. The same estimates were used for all the methods. Adopting the power of the captured signal as the initial value of , we repeatedly applied WPE+MPDR to the captured signal, and updated and using the outputs of the WPE dereverberation and MPDR beamforming, respectively. The number of iterations was set at two. The steering vector was estimated based on the generalized eigenvalue decomposition with covariance whitening [27, 28] assuming that each utterance has noise-only periods of 225 ms and 75 ms, respectively, at its beginning and ending parts.
V-C Evaluation with objective speech enhancement measures
Table I summarizes evaluation results obtained using objective speech enhancement measures. First, all the methods improved the speech quality with all the measures. In addition, WPE+MPDR greatly outperformed WPE dereverberation and MPDR beamforming, while the proposed method further outperformed WPE+MPDR for all the metrics except for SRMR on SimData. These results clearly show the superiority of WPD beamforming.
V-D Evaluation using ASR
Table II shows the word error rates (WERs) obtained using the baseline ASR system. The proposed method greatly outperformed all the other methods under all the conditions.
Finally, it may be interesting to compare WPD beamforming roughly111The analysis conditions used for the two methods, such as the length of the convolutional filter and the way of calculating and , are not the same. with the frontend of the best performing system [22] at the REVERB challenge. The frontend was composed of WPE dereverberation and MVDR beamforming followed by a nonlinear denoising method, DOLPHIN [39]. With this frontend and the kaldi ASR baseline, the average WERs for RealData were 10.29 and 9.07 % w/o and w/ DOLPHIN, respectively. In contrast, when we evaluated WPD beamforming w/o and w/ DOLPHIN, the WERs were 9.27 and 8.91 %, respectively. This again indicates the superiority of WPD beamforming.
VI Concluding remarks
This paper presented a method for unifying WPE dereverberation and MPDR beamforming that made it possible to perform denoising and dereverberation both optimally and simultaneously based on microphone array signal processing. Convolutional beamforming by WPD was derived and shown to improve the speech enhancement performance in noisy reverberant environments, with regard to objective speech enhancement measures and WERs, in comparison with conventional methods, including WPE dereverberation, MPDR beamforming, and WPE+MPDR. Future work will include an evaluation of WPD beamforming in various environments, the introduction of different optimization criteria, and the extension of the proposed method to online processing.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] X. Anguera, C. Wooters, and J. Hernando, “Acoustic beamforming for speaker diarization of meetings,” IEEE Trans. ASLP , vol. 15, no. 7, pp. 2011–2022, 2007.
- 2[2] H. L. V. Trees, Optimum Array Processing, Part IV of Detection, Estimation, and Modulation Theory , Wiley-Interscience, New York, 2002.
- 3[3] H. Cox, “Resolving power and sensitivity to mismatch of optimum array processors,” The Journal of the Acoustical Society of America , vol. 54, pp. 771–785, 1973.
- 4[4] T. Higuchi, N. Ito, S. Araki, T. Yoshioka, M. Delcroix, and T. Nakatani, “Online MVDR beamformer based on complex Gaussian mixture model with spatial prior for noise robust ASR,” IEEE/ACM Transactions on Audio, Speech, and Language Processing , vol. 25, no. 4, pp. 780–793, 2017.
- 5[5] H. Erdogan, J. R. Hershey, S. Watanabe, M. I. Mandel, and J. Le Roux, “Improved MVDR beamforming using single-channel mask prediction networks,” Proc. Interspeech , pp. 1981–1985, 2016.
- 6[6] S. Emura, S. Araki, T. Nakatani, and N. Harada, “Distortionless beamforming optimized with l 1 subscript 𝑙 1 l_{1} -norm minimization,” IEEE Signal Processing Letters , vol. 25, no. 7, pp. 936–940, 2018.
- 7[7] E. Warsitz and R. Haeb-Umbach, “Blind acoustic beamforming based on generalized eigenvalue decomposition,” IEEE Transactions on Audio, Speech, and Language Processing , vol. 15, no. 5, 2007.
- 8[8] S. Araki, H. Sawada, and S. Makino, “Blind speech separation in a meeting situation with maximum SNR beamformer,” Proc. IEEE ICASSP , pp. 41–44, 2007.
