Frequency bin-wise single channel speech presence probability estimation using multiple DNNs
Shuai Tao, Himavanth Reddy, Jesper Rindom Jensen, Mads Gr{\ae}sb{\o}ll, Christensen

TL;DR
This paper introduces a frequency bin-wise speech presence probability estimation method using multiple DNNs, reducing model complexity and improving detection accuracy over traditional methods.
Contribution
It proposes a novel frequency bin-wise approach with separate DNNs, lowering complexity and data requirements compared to conventional all-bin models.
Findings
Improved speech detection accuracy with the bin-wise model
Outperforms state-of-the-art SPP methods in accuracy
Reduces model complexity significantly
Abstract
In this work, we propose a frequency bin-wise method to estimate the single-channel speech presence probability (SPP) with multiple deep neural networks (DNNs) in the short-time Fourier transform domain. Since all frequency bins are typically considered simultaneously as input features for conventional DNN-based SPP estimators, high model complexity is inevitable. To reduce the model complexity and the requirements on the training data, we take a single frequency bin and some of its neighboring frequency bins into account to train separate gate recurrent units. In addition, the noisy speech and the a posteriori probability SPP representation are used to train our model. The experiments were performed on the Deep Noise Suppression challenge dataset. The experimental results show that the speech detection accuracy can be improved when we employ the frequency bin-wise model. Finally, we…
| Methods | () | AUC |
|---|---|---|
| IMCRA [4] | 0.1183 | 0.6504 |
| Unbiased [5] | 0.3460 | 0.7348 |
| General [7] | 0.1132 | 0.6229 |
| Self-Attention [31] (1.1 hours) | 0.4617 | 0.8100 |
| Typical DNN-based (1.1 hours) | 0.4509 | 0.7993 |
| Typical DNN-based (16.6 hours) | 0.4652 | 0.8012 |
| Proposed () (1.1 hours) | 0.5012 | 0.7986 |
| Proposed () (1.1 hours) | 0.5038 | 0.8011 |
| Proposed () (1.1 hours) | 0.4891 | 0.7988 |
| Methods | Para | FLOPs (Mac) |
|---|---|---|
| Self-Attention [31] | 867.12K | 85.6M |
| Typical DNN-based | 100.62K | 13.1M |
| Proposed () | 1548 | 2451 |
| Proposed () | 2292 | 3188 |
| Proposed () | 3024 | 3920 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
Frequency bin-wise single channel speech presence probability estimation using multiple DNNs
Abstract
In this work, we propose a frequency bin-wise method to estimate the single-channel speech presence probability (SPP) with multiple deep neural networks (DNNs) in the short-time Fourier transform domain. Since all frequency bins are typically considered simultaneously as input features for conventional DNN-based SPP estimators, high model complexity is inevitable. To reduce the model complexity and the requirements on the training data, we take a single frequency bin and some of its neighboring frequency bins into account to train separate gate recurrent units. In addition, the noisy speech and the probability SPP representation are used to train our model. The experiments were performed on the Deep Noise Suppression challenge dataset. The experimental results show that the speech detection accuracy can be improved when we employ the frequency bin-wise model. Finally, we also demonstrate that our proposed method outperforms most of the state-of-the-art SPP estimation methods in terms of speech detection accuracy and model complexity.
Index Terms— frequency bin-wise, speech presence probability, probability, gated recurrent units
1 Introduction
Noise estimation is one of the key components to realize single-channel and multi-channel speech enhancement, most of which rely on the speech presence probability (SPP) to update the noise statistics [1, 2, 3]. Available noise power spectral density (PSD) estimators also make use of the SPP to decide when to update the noise PSD [4, 5, 6]. Compared to voice activity detectors (VAD), SPP is a soft-decision approach that depends on the correlation of inter-bands and inter-frames [7]. Accurate SPP estimation can greatly improve the effectiveness of speech enhancement [8, 9].
In the short time-frequency transform (STFT) domain, some conventional statistical signal processing methods commonly assume that the spectral coefficients of speech and noise are independent and follow the complex Gaussian distribution [10, 11]. Therefore, the SPP can be derived from the probability of the time-frequency (T-F) bins of the noisy speech. According to this assumption, [4] applied the minima values of a smoothed periodogram to estimate the SPP which enables the SPP estimation to be more robust under the effect of non-stationary noise. In [5], to achieve a highly accurate SPP estimate with low latency and computational complexity, an optimal fixed SNR was used to guarantee the SPP to be close to zero when speech is absent. In addition, [7] takes the correlation of inter-band and inter-frame into account when designing a general SPP estimator.
Recently, deep neural networks (DNNs) have been proven to be effective at processing non-stationary noise, and many novel DNN-based approaches have been proposed to estimate SPP accurately, which have been applied to speech enhancement and speech recognition successfully [12, 13, 14]. In these methods, recurrent neural networks (RNNs) [15] are commonly used to acquire information from neighboring frames since the frames contain temporal information which can improve the accuracy of SPP estimation. In [14], a bidirectional long short-term memory (BLSTM) was trained by the input features of multi-time frames with all frequency bins to estimate the SPP. In [12], considering the ideal ratio mask (IRM) [16] ranges from 0 to 1 at each T-F bin, they selected different DNN models, such as LSTM, BLSTM, gate recurrent units (GRUs), and bidirectional GRU (BGRU) to estimate the IRM and approximate the SPP. However, the problem that arises here is that as the complexity of the model goes up and more training data is applied to the model, more powerful hardware is required to train the models.
Inspired by conventional SPP estimation methods, our model estimates the SPP based on the correlation of several neighboring T-F bins in contrast to the typical DNN-based SPP estimation approach where all frequency bins are regarded as the input features. This allows us to use DNNs on a one-to-one basis with frequency bins therefore vastly reducing the number of parameters in the model and the amount of computations taking place. In this work, we thus propose a frequency bin-wise SPP estimation model in the STFT domain that relies on using multiple DNNs to estimate the SPP. For our proposed model architecture, the GRU module is used to extract time and frequency information from each frequency bin and several of its neighbors. Additionally, since IRM-based SPP estimation methods may misclassify the T-F bins dominated by non-speech and noise [17, 18, 12], we choose the probability to represent the SPP in the STFT domain.
The work is organized as follows. In Section 2, the problem of frequency bin-wise single channel SPP estimation is formulated. In Section 3, the SPP estimation model with multiple DNNs is designed. In Section 4 and Section 5, the experimental procedures and results are provided, respectively. Finally, Section 6 presents the conclusion. The work can be found on GitHub111https://github.com/Shuaitaoaau/SPP.
2 Frequency Bin-Wise SPP Estimation
2.1 Signal Modeling
For the single channel speech signal , we assume that it is corrupted by the additive noise . That is, in the STFT domain, we can obtain the noisy speech representation as follows:
[TABLE]
where denotes the frequency bin index and is the number of frequency bins, denotes the time frame index and is the number of time frames. With the assumption of a zero-mean complex Gaussian distribution and independence for and , we have
[TABLE]
where is the statistical expectation operator, and . The PSD of the clean and the noisy speech can be represented by and , respectively. In the STFT domain, there exists a correlation between the neighboring T-F bins [7]. Therefore, the SPP estimate can be improved using the correlation.
The first step in creating our input signal vector is to obtain a vector corresponding to each individual frequency bin,
[TABLE]
Each frequency bin vector contains consecutive time frames, which contain relevant contextual information for the estimation of the SPP. Since RNNs are effective at processing temporal information [19, 20], we employ RNNs in this work to extract time correlations from the neighboring time frames.
To improve the SPP estimation accuracy, we take a few neighboring frequency bin vectors into consideration to extract frequency correlations from the input signal matrix. Therefore, the input signal matrix can be obtained as
[TABLE]
where is the number of neighboring frequency bin vectors.
Now, the time correlation and frequency correlation of neighboring time-frequency bins can be extracted according to the input signal matrix . In this work, the SPP is represented by the a posteriori probability [5], and the DNN is used to estimate the SPP from the noisy observation.
Since the typical DNN-based approach takes all the frequency bins into account to estimate the SPP, the model complexity may be increased. In this section, we, therefore, design multiple specific DNNs to estimate the frequency bin-wise SPP. Additionally, since the probability is derived by the correlation of neighboring T-F bins, the probability SPP representation of the clean speech and the noisy speech PSD are used as the training data pairs to train our model.
2.2 SPP Estimation Model and Loss Function
To extract the time and frequency correlation of the consecutive T-F bins in the input signal matrix from the observed noisy PSD , we set specific DNNs as the regression module. As mentioned in (4), the coefficient of the ’th input signal matrix can be used to train the ’th DNN for the SPP estimate in the ’th frequency bin.
First, to train the DNN model, we choose the log-power periodogram as the input feature [21, 22]. Therefore, the input features of each individual DNN are obtained from the log input signal matrix . It can be expressed as
[TABLE]
where is the input feature for the ’th DNN. Also, during training, we have
[TABLE]
where is the SPP estimate of the ’th input features, is the ’th DNN with the parameter . To update the DNN parameters, the loss between the target and the estimated SPP is calculated by mean-squared error (MSE), i.e.,
[TABLE]
where is the target function. In this work, the probability is regarded as the SPP representation, therefore can be represented by
[TABLE]
where and denote speech absence and presence probability, is the SNR during speech presence [5].
2.3 Model Architecture
In this work, since a GRU can outperform an LSTM both in terms of convergence in CPU time, and in terms of parameter updates and generalization [23], we choose GRUs to design the SPP estimation model. The model training strategy is shown in Fig. 1 and the DNN model is trained by the input features of the logarithmic power spectral T-F bins.
The training strategy of the typical DNN-based SPP estimation model in Fig. 1(a) shows that a GRU module is trained using frequency bins (all frequency bins) and consecutive time frames. The typical DNN-based model input size is and, in this work, the size of the hidden layer is the same as the size of the input layer. The proposed training strategy of the frequency bin-wise SPP estimation model is shown in Fig. 1(b). When neighboring frequency bins are introduced to estimate the SPP of a single frequency bin, the input size is , and one hidden layer is set. The output of each hidden layer state is regarded as the value of the SPP estimate at the current time. Finally, to restrict the output range of the DNN to [0, 1], the output layer is the activation function with a fixed parameter .
3 Experimental Settings
In this work, the sub-band DNS dataset is used to train our designed model. During testing, 200 noisy utterances (1.1 hours) and 1800 noisy utterances (1 hour) were collected from the DNS dataset [24], and the TIMIT dataset [25], respectively. Each clean utterance is corrupted by a random noise utterance selected from the noise dataset, each noisy utterance SNR ranging from -5dB to 25 dB. The noise data includes 150 different types of noise taken from Audioset [26] Freesound [27] and Demand datasets [28].
The receiver operating characteristic (ROC) [29] curve is used to evaluate the SPP estimation method performance and the false-alarm probability given in [7] is used to calculate the speech detection probability, . Additionally, we apply the area under curve (AUC) metric which is derived from ROC and ranges between [0, 1] to represent overall performance. We also adopt the adaptive threshold set to -60 dB below the maximum instantaneous power across all TF bins shown in [7] to distinguish the speech and non-speech bins across all T-F bins of clean speech.
The sampling rate of all utterances is 16 kHz. Hann window is applied to STFT analysis and the length of the time window for STFT is 16 ms and the hop length is 8 ms. We use the mean and standard derivation to normalize the dataset. During training, the Adam optimizer [30] is utilized to optimize the neural network parameters. The learning rate is set to 0.001. Weight decay is set to 0.00001 to prevent overfitting. The parameter will be updated at the 50th and 100th epochs for the implemented DNN models. Pytorch is used to implement the frequency bin-wise SPP estimation model and the reference DNN-based model.
4 Results and Discussion
In this section, to prove the effectiveness of our method, a comparison is shown between a typical DNN-based model and our proposed method using ROC curves. Moreover, some numerical results are provided to evaluate the accuracy of the SPP estimators and the model complexity, respectively.
4.1 Examination of ROC Curves
To investigate the performance of the proposed method, 200 training utterances (1.1 hours) are used to train our proposed frequency bin-wise model. In addition, 200 utterances (1.1 hours), 1000 utterances (5.5 hours), and 3000 utterances (16.6 hours) are used to train the typical DNN-based model, respectively. To investigate the effect of using neighboring frequency bins for the proposed method, we set (no neighboring frequency bins), (with 1 neighboring frequency bin), and (with two neighboring frequency bins) to train the frequency bin-wise model. Fig. 2 shows an example of SPP estimation results. A noisy utterance of length 20 seconds and input SNR of 11 dB taken from the DNS dataset, is used for testing by the typical DNN-based SPP estimation model and the frequency bin-wise model.
From Fig. 2, we can observe that the typical DNN-based method and the proposed frequency bin-wise method are able to estimate the SPP with similar accuracy. In addition, we also investigate the impact of the training data volume on SPP estimation accuracy for the typical DNN-based SPP estimation model. From Fig. 3, we can find that when we increase training data from 1.1 hours to 5.5 hours and then to 16.6 hours for the typical DNN-based model, there is a gradual increase in AUC but still falls short of our proposed method in terms .
4.2 Numerical Results
To evaluate the performance of the proposed method, the speech detection probability and the AUC are calculated from the ROC curves to represent the speech detection accuracy and the effectiveness of the SPP estimation method, respectively. In addition, we also investigate the effect of model complexity on SPP estimation accuracy. Inspired by [31] and [32], we compare our method with the state-of-the-art self-attention model and, in this work, 3 self-attention heads and 2 encoder layers are used to estimate the SPP. The self-attention model is trained in a typical way where all the frequency bins are treated as input features. During training, the frequency bin-wise SPP estimation model and the self-attention-based SPP estimation model are trained with 1.1 hours of training data pairs. The typical DNN-based model is trained with 1.1 and 16.6 hours of training data pairs, respectively. All training data pairs come from the DNS dataset.
In Table 1, we show how the proposed model compares to other conventional methods and a few DNN-based methods using and AUC as metrics. The results in Table 1 are obtained from testing using the TIMIT dataset (1 hour).
With 1.1 hours of training data, we can observe that the frequency bin-wise model AUC (0.7986) is lower than the typical DNN-based model and the self-attention-based model, it is still higher than IMCRA [4] (0.6504), Unbiased MMSE [5] (0.7348) and General SPP estimator [7] (0.6229). Especially, when we set and , the sub-frequency bin-based model achieved higher AUCs of 0.8011 and 0.7988, respectively. For the speech detection accuracy, all the frequency bin-wise models achieved higher speech detection accuracy than other methods and when we take one neighboring frequency bin () into account the speech detection probability can reach 0.5038.
According to the results, we can confirm that an increase in model complexity can improve the performance of DNN-based applications, and in this work, the SPP estimation accuracy can also be improved, which is consistent with the experimental results shown in [33]. The reason is that the complex model can extract more global information than the simple model to estimate the SPP from all frequency bins. Additionally, a remarkable improvement in speech detection accuracy appears when we employ our proposed method to estimate the SPP, especially when we set , the model performance and are improved. The reason for the improved performance could be that the DNNs can extract specific contextual information for each frequency bin which is not possible when due to the lack of inclusion of its neighbors.
Finally, by comparing the AUC of different SPP estimation methods, we can observe that all DNN-based models can achieve higher performance of SPP estimation than the conventional methods. For DNN-based SPP estimation models, although all the presented models demonstrate similar performance, the speech detection accuracy is different. Therefore, it can be observed that more details can be detected by the bin-wise model leading to better detection accuracy.
4.3 Computational Complexity
To evaluate the complexity of the proposed model relative to its counterparts, we use the number of parameters and floating point operations (FLOPs) as the metrics. For our proposed frequency bin-wise model, the total parameters and FLOPs of all the models are used to represent computational complexity. We use the ptflops 222https://pypi.org/project/ptflops/ python library to calculate the total parameters and FLOPs for our method and the reference DNN-based methods. Table 2 shows that our proposed method has fewer parameters and FLOPs than the other methods. The reason is that although we use multiple DNNs to estimate the SPP, each DNN has less input size than the typical DNN-based model. Furthermore, although we introduced the neighboring frequency bins to estimate the SPP in 4.2, from Table 2, we can also observe that the increase in computational complexity is minimal even with the inclusion of additional neighboring frequency bins.
From the above experimental results, we can confirm that although increasing the training data and using complex models can contribute to the improvement of the performance of the typical DNN-based SPP model, high computational complexity is inevitable. However, it can be observed that the proposed frequency bin-wise model not only shows an improvement in while maintaining similar performance in terms of the AUC but also reduces the computational complexity while using the same amount of training data.
5 Conclusion
In this work, we proposed an effective frequency bin-wise SPP estimation method that shows good performance with a limited amount of training data while also maintaining low model complexity. Experimental results show that in addition to reducing the model complexity, the frequency bin-wise model also shows better performance even in comparison with the typical DNN-based model that is trained with increasing amounts of training data. The experimental observations involving the inclusion of neighboring frequency bins show that there is an increase in speech detection accuracy as well as the AUC (compared to its counterpart that does not include any neighboring frequency bins) due to being exposed to local contextual information. Since multiple DNNs are employed to estimate the SPP in the STFT domain, the frequency bin-wise model’s computational complexity is much lower than its DNN-based counterparts.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] M. Kim and J. W. Shin, “Improved speech enhancement considering speech PSD uncertainty,” IEEE/ACM Transactions on Audio, Speech, and Language Processing , vol. 30, pp. 1939–1951, 2022.
- 2[2] S. K. Roy and K. K. Paliwal, “Robustness and sensitivity metrics-based tuning of the augmented kalman filter for single-channel speech enhancement,” Applied Acoustics , vol. 185, p. 108355, 2022.
- 3[3] Y. Zhao, J. K. Nielsen, J. Chen, and M. G. Christensen, “Model-based distributed node clustering and multi-speaker speech presence probability estimation in wireless acoustic sensor networks,” The Journal of the Acoustical Society of America , vol. 147, no. 6, pp. 4189–4201, 2020.
- 4[4] I. Cohen, “Noise spectrum estimation in adverse environments: Improved minima controlled recursive averaging,” IEEE Transactions on speech and audio processing , vol. 11, no. 5, pp. 466–475, 2003.
- 5[5] T. Gerkmann and R. C. Hendriks, “Unbiased MMSE-based noise power estimation with low complexity and low tracking delay,” IEEE Transactions on Audio, Speech, and Language Processing , vol. 20, no. 4, pp. 1383–1393, 2011.
- 6[6] M. Souden, J. Chen, J. Benesty, and S. Affes, “An integrated solution for online multichannel noise tracking and reduction,” IEEE Transactions on Audio, Speech, and Language Processing , vol. 19, no. 7, pp. 2159–2169, 2011.
- 7[7] H. Momeni, E. A. Habets, and H. R. Abutalebi, “Single-channel speech presence probability estimation using inter-frame and inter-band correlations,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2014, pp. 2903–2907.
- 8[8] M. Souden, J. Chen, J. Benesty, and S. Affes, “Gaussian model-based multichannel speech presence probability,” IEEE Transactions on Audio, Speech, and Language Processing , vol. 18, no. 5, pp. 1072–1077, 2009.
