Reducing the Prior Mismatch of Stochastic Differential Equations for Diffusion-based Speech Enhancement
Bunlong Lay, Simon Welker, Julius Richter, Timo Gerkmann

TL;DR
This paper introduces a Brownian bridge-based forward process for diffusion models in speech enhancement, reducing prior mismatch and improving objective metrics with fewer steps and hyperparameters.
Contribution
It proposes a novel Brownian bridge-based forward process that reduces prior mismatch in diffusion models for speech enhancement, leading to better performance with fewer steps.
Findings
Reduces prior mismatch compared to previous diffusion processes.
Improves objective metrics over baseline with half the iteration steps.
Simplifies hyperparameter tuning.
Abstract
Recently, score-based generative models have been successfully employed for the task of speech enhancement. A stochastic differential equation is used to model the iterative forward process, where at each step environmental noise and white Gaussian noise are added to the clean speech signal. While in limit the mean of the forward process ends at the noisy mixture, in practice it stops earlier and thus only at an approximation of the noisy mixture. This results in a discrepancy between the terminating distribution of the forward process and the prior used for solving the reverse process at inference. In this paper, we address this discrepancy and propose a forward process based on a Brownian bridge. We show that such a process leads to a reduction of the mismatch compared to previous diffusion processes. More importantly, we show that our approach improves in objective metrics over the…
| SDE | POLQA | PESQ | ESTOI | SI-SDR [dB] | SI-SIR [dB] | SI-SAR [dB] |
|---|---|---|---|---|---|---|
| Mixture | - | |||||
| Baseline OUVE [8] | ||||||
| BBED | ||||||
| BBED |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Advanced Adaptive Filtering Techniques · Advanced MIMO Systems Optimization
MethodsDiffusion
BB Brownian bridge SGM score-based generative model SNR signal-to-noise ratio GAN generative adversarial network VAE variational autoencoder DDPM denoising diffusion probabilistic model STFT short-time Fourier transform iSTFT inverse short-time Fourier transform SDE stochastic differential equation ODE ordinary differential equation OU Ornstein-Uhlenbeck VE Variance Exploding DNN deep neural network PESQ Perceptual Evaluation of Speech Quality SE speech enhancement T-F time-frequency ELBO evidence lower bound WPE weighted prediction error PSD power spectral density RIR room impulse response SNR signal-to-noise ratio LSTM long short-term memory POLQA Perceptual Objectve Listening Quality Analysis SDR signal-to-distortion ratio ESTOI Extended Short-Term Objective Intelligibility ELR early-to-late reverberation ratio TCN temporal convolutional network DRR direct-to-reverberant ratio NFE number of function evaluations RTF real-time factor
\name
Bunlong Lay1, Simon Welker1,2, Julius Richter1, Timo Gerkmann1††thanks: We acknowledge the support by DASHH (Data Science in Hamburg - HELMHOLTZ Graduate School for the Structure of Matter) with the Grant-No. HIDSS-0002 and the German Research Foundation (DFG) in the transregio project Crossmodal Learning (TRR 169).
Reducing the Prior Mismatch of Stochastic Differential Equations for Diffusion-based Speech Enhancement
Abstract
Recently, score-based generative models have been successfully employed for the task of speech enhancement. A stochastic differential equation is used to model the iterative forward process, where at each step environmental noise and white Gaussian noise are added to the clean speech signal. While in limit the mean of the forward process ends at the noisy mixture, in practice it stops earlier and thus only at an approximation of the noisy mixture. This results in a discrepancy between the terminating distribution of the forward process and the prior used for solving the reverse process at inference. In this paper, we address this discrepancy and propose a forward process based on a Brownian bridge. We show that such a process leads to a reduction of the mismatch compared to previous diffusion processes. More importantly, we show that our approach improves in objective metrics over the baseline process with only half of the iteration steps and having one hyperparameter less to tune.
Index Terms: speech enhancement, diffusion models, stochastic differential equations, Brownian bridge.
1 Introduction
Speech enhancement aims to recover the clean speech signal from a noisy mixture that is corrupted by environmental noise [1]. Classical approaches try to exploit statistical relations of the clean speech signal and the environmental noise [2]. Numerous machine learning methods have been proposed that treat speech enhancement as a discriminative learning task [3, 4].
Different from discriminative approaches that learn a direct mapping from noisy to clean speech, generative approaches learn a prior distribution over clean speech data. Recently, so-called score-based generative models (or diffusion models) were introduced to the task of speech enhancement [5, 6, 7, 8, 9]. The idea is to iteratively add Gaussian noise to the data using a discrete and fixed Markov chain called forward process, thereby transforming data into a tractable distribution such as a normal distribution. Then, a neural network is trained to invert this diffusion process in a so-called reverse process [10]. When the step size between two discrete Markov chain states is taken to zero, the discrete Markov chain becomes a continuous-time stochastic differential equation (SDE) under mild constraints. Utilizing SDEs offers more flexibility and opportunities than approaches based on discrete Markov chains [11]. For example, SDEs allow to use general-purpose SDE solvers to numerically integrate the reverse process, impacting the performance and number of iteration steps. An SDE can be interpreted as a transformation between two given distributions, where one is called the initial distribution and the other the terminating distribution. In the case of speech enhancement, we transform between the distribution of clean speech data and the distribution of noisy mixture data. Under mild constraints, we can find for each forward SDE a reverse SDE inverting the forward SDE [12, 13]. This reverse SDE starts from a noisy mixture and ends at the clean speech. It can be therefore used for speech enhancement.
Currently, for the task of speech enhancement, there are different approaches that integrate the corruption of environmental noise in the diffusion process [6, 7, 8]. To compensate for non-Gaussian noise characteristics, these approaches use an interpolation between clean speech and noisy speech data along the forward process. In [7, 8] a continuous-time SDE is used, which includes a drift term that allows the transformation between clean and noisy speech. Interestingly, the mean of the process in [7, 8] evolves from clean speech perfectly to noisy speech only for an infinitely long forward diffusion process. In practice, however, the mean of the forward process ends at an approximation of the noisy speech data. Therefore, when solving the reverse SDE to perform speech enhancement, there exists a mismatch between the terminating distribution of the forward process and the initial distribution of the reverse process [8]. We call the initial distribution of the reverse process the prior distribution of the generative model and the corresponding mismatch the prior mismatch. Moreover, the SDEs in [7, 8, 14, 15] includes a stiffness parameter controlling the pull of the terminating distribution of the forward process and the prior distribution. Consequently, this stiffness parameter determines the degree of the resulting prior mismatch. Increasing the stiffness reduces the prior mismatch, but may also negatively affect the speech enhancement performance as the reverse process may become unstable [8, Section II D].
To overcome this limitation, we seek to reduce the prior mismatch without destabilizing the reverse process. To this end, we propose to replace the forward process in [7, 8] with an SDE based on a Brownian bridge process. A Brownian bridge seems suitable for this purpose because it has fixed starting and end points and follows a Brownian motion in between. We show that the resulting diffusion process does not only drastically decrease the prior mismatch, but also eliminates the dataset-dependent and hard-to-tune stiffness parameter of the SDE in [7, 8]. In the experiments, we demonstrate that using the proposed SDE outperforms the baseline SDE while having one hyperparameter less to tune and using only half as many function evaluations 111code online available https://github.com/sp-uhh/sgmse-bbed.
2 Background
The task of speech enhancement is to estimate the clean speech signal from a noisy mixture , where is environmental noise. All variables in bold are the coefficients of a complex valued short-time Fourier transform (STFT), e.g. and with number of STFT frames and number of frequency bins.
2.1 Stochastic Differential Equations
Following the approach in [7, 8], we model the forward process of the score-based generative model with an SDE defined on :
[TABLE]
where is the standard Wiener process [16], is the current process state with initial condition , and a continuous diffusion time-step variable describing the progress of the process ending at the last diffusion time-step . Moreover, can be integrated by Lebesgue integration [17], and follows Ito integration [16]. The functions and are called drift and diffusion coefficient, respectively. The diffusion coefficient regulates the amount of Gaussian noise that is added to the process, and the drift affects mainly in the case of linear SDEs the mean of (see [16, (6.10)]). The process state follows a Gaussian distribution [18, Ch. 5], called the perturbation kernel:
[TABLE]
By Anderson [12], each forward SDE as in (1) can be associated to a reverse SDE:
[TABLE]
where is a Wiener process going backwards in time. In particular, the reverse process starts at and ends at . Here is a parameter that needs to be set for practical reasons as the last diffusion time-step is only reached in limit. The score function is approximated by a neural network called score model , which is parameterized by a set of parameters . Assuming that is available, we can generate an estimate of the clean speech from by solving the reverse SDE.
The prior mismatch discussed in this paper is defined by the difference of to . In this work and previous work [7, 8], we consider only SDEs where the mean is of the form
[TABLE]
where is an increasing function. In the sequel, we will simply write for brevity. The mismatch of such an SDE is determined by and we call the maximal interpolation factor (MIF) for the rest of the paper. It is desired that the MIF is close to 1 and we will see in the following sections to which degree this goal is met.
3 Design choices of different SDEs
3.1 Ornstein-Uhlenbeck with Variance Exploding (OUVE)
In [7, 8] an SDE is used with the drift coefficient and diffusion coefficient defined as
[TABLE]
for and parameters , . Such a drift term is typical for an Ornstein-Uhlenbeck process [16], whereas the diffusion coefficient is taken from the so-called Variance Exploding SDE [11]. Thus, we call the baseline SDE Ornstein-Uhlenbeck with Variance Exploding (OUVE). A reparameterization of (6) with and yields
[TABLE]
We argue that this equivalent representation of the diffusion coefficient may increase the intuition of (6), as simply scales the diffusion coefficient and is the base of the exponential term. We will simply use the parameterization of Eq. (7) for the rest of this work.
The closed-form solution for the mean and variance of the perturbation kernel of this SDE are given by:
[TABLE]
and
[TABLE]
We see from (9) that for large , we have that has mean . However, as in practice we need to decide for a finite final diffusion time-step , a certain difference between the mean of and remains. If we parameterize the OUVE SDE as in [8], i.e. and , then we find that the MIF is . As it is desired to have a MIF close to 1, we argue that the difference between and is relatively large. Note that increasing for fixed to obtain a better MIF is equivalent to fixing and increasing . Moreover, we have that increasing , yields a better MIF, but also worsens the performance of this approach, as the sampling from the reserve SDE becomes unstable [8, Section II D]. Therefore, increasing the MIF for this SDE is not straightforward.
3.2 Brownian Bridge with Exponential Diffusion Coefficient (BBED)
In order to reduce the prior mismatch, we propose to employ an SDE that has a linear interpolation factor , where . Substituting the mean of the SDE in (4) becomes
[TABLE]
One can find an SDE with the following drift coefficient that has the desired mean from (10) by solving [16, (6.12)]
[TABLE]
Comparing (10) and (4), we see that the MIF is . Note, that the choice of is limited due to numerical stability as we divide by in (11). However, it is still possible to achieve a much better MIF compared to the MIF of the OUVE SDE, as we will see in Section 5.2.
For a fair comparison to the OUVE SDE, we want to utilize the same diffusion coefficient as from the OUVE SDE in Eq. (7). The resulting variance can be computed from [16, (6.11)]:
[TABLE]
where denotes the exponential integral function [19]. The variance trajectory exhibits one peak and vanishes for and . The position of the peak is solely determined by , where larger shifts the peak closer to .
In the literature, SDEs that linearly transform the starting condition ( and zero variance for ) to the terminal condition ( and zero variance for ) with a constant diffusion coefficient of are called Brownian bridges [16]. As the SDE with drift coefficient (11) and diffusion coefficient (7) differs from that definition only in the diffusion coefficient, we call the SDE a Brownian Bridge with Exponential Diffusion coefficient (BBED).
4 Experimental setup
To allow a fair comparison between BBED SDE with OUVE SDE, we train the corresponding score models with the same configuration and follow the experimental setup from [8, Section V].
4.1 Training
For the score model , we employ the Noise Conditional Score Network (NCSN++) architecture (see [8, 11] for more details). The network is optimized based on denoising score matching:
[TABLE]
where with . We train the network with the ADAM optimizer [20] with a learning rate of and a batch size of 16. An exponential moving average of network parameters is tracked with a decay of 0.999, to be used for sampling [7, 11]. We train for 250 epochs and log the averaged PESQ value of 10 random files from the validation set during training and select the best-performing model for evaluation. Experiments are conducted on an NVIDIA A6000 and training lasts for approximately 4 days.
4.2 Dataset and input representation
We use the same WSJ0-CHiME3 dataset as in [8]. This dataset mixes clean speech utterances from the Wall Street Journal (WSJ0) dataset [21] to noise signals from the CHiME3 dataset [22] with an uniformly sampled signal-to-noise ratio (SNR) between 0 and 20 dB. The dataset is split into a train (12777 files), validation (1206 files) and test set (615 files).
Each file from the WSJ0-CHiME3 dataset is converted into a complex STFT representation with a window size of 510, resulting in 256 frequency bins, a hop size of 128 and a periodic Hann window. We randomly crop the STFT representation to a length of 256 frames at each training step. To compensate for the typically heavy-tailed distribution of STFT speech magnitudes [23], as in [8], each complex coefficient of the STFT representation is transformed via with and .
4.3 Sampling and metrics
For the baseline OUVE SDE and the proposed BBED SDE we use the same sampler settings for a fair comparison. We use a Predictor-Corrector scheme as in [8, 11], where the Predictor is the Euler-Maruyama method [18] and the Corrector is the Annealed Langevin Dynamics (ALD) method [11]. As in [8], the step size for ALD is chosen as and the number of reverse steps is . Equivalently, the step size in the reverse process is , where is set for the OUVE SDE and BBED SDE individually (see Section 4.4). For the reverse process, we set the reverse starting time at . We also report results when experimenting with the reverse starting times in Section 5.3 while keeping the step size fixed.
We evaluate the performance on perceptual metrics, wideband PESQ [24] and POLQA [25], on energy-based metrics SI-SDR, SI-SIR and SI-SAR [26] and on intelligibility metric ESTOI [27].
4.4 OUVE and BBED
As in [8], the parameters , , and were already tuned by a grid search. Therefore, we set as in [8] and , and the diffusion coefficient parameters in Eq. 6 are set to and , or in the equivalent representation in Eq. 7, we set and .
For the BBED SDE, we search for the largest in so that training and inference is numerically stable. The parameter in (7) is determined as the empirically optimal choice of . The values of the grid have been chosen in such a way that the resulting variances have their peaks ranging from to . For example, the resulting variance for has its maximum at , the variance for has its maximum at , etc. For each , we set the normalization factor so that the variances admit a maximum value of either 0.15 or 0.3. This choice is based on the OUVE SDE parameterization also having a maximum value of . Exemplary, we plot two parameterizations of the variance of the BBED SDE in Fig. 1.
5 Results
First, we present the results when parameterizing the BBED SDE as described in Section 4.4. Second, we discuss if the proposed BBED SDE reduces the prior mismatch compared to the baseline OUVE SDE. Last, we discuss the performance differences in terms of objective metrics, number of iterations in the reverse process and subjective differences of the OUVE SDE and BBED SDE.
5.1 Parameterization of the BBED SDE
When training and testing the score-model with the BBED SDE with different , we argue that it is beneficial to have the variance maximum towards the end of the forward process, as the Gaussian noise would better mask the speech features corrupted by the environmental noise. At the same, if the variance maximum is too close to the end of the forward process, which is at , then the diffusion coefficient becomes numerically large and consequently the reverse process may become unstable. Empirically, we found that with maximum variance results in the best performance (see Fig. 1 black line). When training and testing the score-model with the BBED SDE with different , we found that is the largest value that causes no numerical issues.
5.2 Reducing the prior mismatch
As we set for the BBED SDE , we have that the MIF is . This is much closer to than the MIF of achieved by the OUVE SDE as discussed in Section 3.1. We illustrate this prior mismatch in terms for SNR in Fig. 2. To this end, let and be time-domain signals, where is the clean speech signal and is any clean speech signal corrupted with environmental noise. We define the SNR(, ) to be , denotes the norm. In Fig. 2, we averaged
[TABLE]
for the BBED SDE and OUVE SDE over the WSJ0-CHiME3 test set. We have that approaches if . This is the case for the BBED SDE as it can be observed in Fig. 2. In comparison, we find that the OUVE SDE differs to the by dB at in Fig. 2, showing that the BBED SDE indeed reduces the prior mismatch compared to the OUVE SDE.
5.3 OUVE vs. BBED
In Tab. 1 we show that BBED outperforms OUVE in all reported metrics. When listening to enhanced files generated by BBED and OUVE, we observe that the enhanced files generated by BBED contain less background noise and breathing artifacts than the enhanced files generated by OUVE. We provide some listening examples in the supplementary material222https://www.inf.uni-hamburg.de/en/inst/ab/sp/publications/sgmse-bbed.
Remarkably, when experimenting with we found that the BBED SDE largely maintains performance when changing to as it can be seen in Tab. 1. This is in contrast to OUVE SDE which loses in PESQ when we set . Since we keep the reverse step size fixed when starting inference at , the number of iterations is halved with only negligible performance loss for the BBED SDE as compared to when starting the reverse process at In particular BBED even outperforms OUVE when only using half as many iterations for enhancement.
The proposed BBED SDE has a different drift coefficient compared to the OUVE SD (compare Eq. (11) and Eq. (5)), which results in different mean evolutions (see Fig. 2) and different variance evolutions (see Fig. 1). Thus, there could be various reasons why the BBED SDE outperforms the OUVE SDE. We hypothesize that a much higher variance of the BBED could be mainly responsible for the improvements, as a higher variance potentially helps to generate better speech estimates. We also believe that too large values for the diffusion coefficient may lead to numerical instability of the reverse process. We leave this discussion for future work.
6 Conclusions
In this paper, we aimed to minimize the prior mismatch in score-based generative modeling for speech enhancement. To this end, we constructed the BBED SDE that is inspired by Brownian bridges. The BBED SDE yields a much smaller prior mismatch compared to the baseline OUVE SDE and has one hyperparameter less to tune. As a result, we consistently improve in all reported metrics over the OUVE SDE. Moreover, the BBED SDE achieves improvements of in PESQ and in POLQA even when only using half as many function evaluations as the OUVE SDE.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] R. C. Hendriks, T. Gerkmann, and J. Jensen, DFT-domain based single-microphone noise reduction for speech enhancement: A survey of the state-of-the-art . Morgan & Claypool, 2013.
- 2[2] T. Gerkmann and E. Vincent, ``Spectral masking and filtering,'' in Audio Source Separation and Speech Enhancement , E. Vincent, T. Virtanen, and S. Gannot, Eds. John Wiley & Sons, 2018.
- 3[3] D. Wang and J. Chen, ``Supervised speech separation based on deep learning: An overview,'' IEEE Trans. on Audio, Speech, and Language Proc. (TASLP) , vol. 26, no. 10, pp. 1702–1726, 2018.
- 4[4] Y. Luo and N. Mesgarani, ``Conv-Tas Net: Surpassing ideal time–frequency magnitude masking for speech separation,'' IEEE Trans. on Audio, Speech, and Language Proc. (TASLP) , vol. 27, no. 8, pp. 1256–1266, 2019.
- 5[5] Y.-J. Lu, Y. Tsao, and S. Watanabe, ``A study on speech enhancement based on diffusion probabilistic model,'' IEEE Asia-Pacific Signal and Inf. Proc. Assoc. Annual Summit and Conf. (APSIPA ASC) , pp. 659–666, 2021.
- 6[6] Y.-J. Lu, Z.-Q. Wang, S. Watanabe, A. Richard, C. Yu, and Y. Tsao, ``Conditional diffusion probabilistic model for speech enhancement,'' IEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP) , 2022.
- 7[7] S. Welker, J. Richter, and T. Gerkmann, ``Speech enhancement with score-based generative models in the complex STFT domain,'' Interspeech , 2022.
- 8[8] J. Richter, S. Welker, J.-M. Lemercier, B. Lay, and T. Gerkmann, ``Speech enhancement and dereverberation with diffusion-based generative models,'' IEEE Trans. on Audio, Speech, and Language Proc. (TASLP) , 2023. [Online]. Available: https://arxiv.org/abs/2208.05830
