Mixed penalization in convolutive nonnegative matrix factorization for   blind speech dereverberation

Francisco J. Ibarrola; Leandro E. Di Persia; Ruben D. Spies

arXiv:1706.00114·cs.SD·June 2, 2017

Mixed penalization in convolutive nonnegative matrix factorization for blind speech dereverberation

Francisco J. Ibarrola, Leandro E. Di Persia, Ruben D. Spies

PDF

Open Access

TL;DR

This paper introduces a novel convolutive nonnegative matrix factorization approach with mixed penalization to improve blind speech dereverberation, demonstrating significant performance gains over existing methods.

Contribution

It proposes a new method combining two penalizers in convolutive NMF for better speech dereverberation in general conditions.

Findings

01

Significant improvement over state-of-the-art dereverberation methods.

02

Effective in restoring speech signals affected by room reverberation.

03

Algorithm demonstrates practical applicability in real-world scenarios.

Abstract

When a signal is recorded in an enclosed room, it typically gets affected by reverberation. This degradation represents a problem when dealing with audio signals, particularly in the field of speech signal processing, such as automatic speech recognition. Although there are some approaches to deal with this issue that are quite satisfactory under certain conditions, constructing a method that works well in a general context still poses a significant challenge. In this article, we propose a method based on convolutive nonnegative matrix factorization that mixes two penalizers in order to impose certain characteristics over the time-frequency components of the restored signal and the reverberant components. An algorithm for implementing the method is described and tested. Comparisons of the results against those obtained with state of the art methods are presented, showing significant…

Tables4

Table 1. Table 1: Model parameter values

$p$	$N_{h}$	win.	window size	win. overlap.	$δ$	max. iter.
1	15	Hann	512 samples	256 samples	${‖ Y ‖}_{F} \times 10^{- 3}$	20

Table 2. Table 2: Mean and (standard deviation) of fwsSNR for each method and reverberation time (best results in boldface).

Rev. time [ms]	Rev. Signal	Kameoka der.	Wisdom der.	Mixed penalization
300	8.102 (1.96)	7.950 (1.73)	8.262 (1.53)	8.658 (1.59)
450	4.815 (1.42)	5.127 (1.36)	5.771 (1.28)	6.539 (1.56)
600	3.082 (1.20)	3.358 (1.19)	4.140 (1.17)	4.732 (1.43)
750	1.998 (1.11)	2.184 (1.10)	3.013 (1.12)	3.440 (1.31)

Table 3. Table 3: Mean and (standard deviation) of cepstral distance for each method and reverberation time (best results in boldface).

Rev. time [ms]	Rev. Signal	Kameoka der.	Wisdom der.	Mixed penalization
300	3.440 (0.44)	4.057 (0.45)	3.908 (0.48)	3.521 (0.35)
450	4.264 (0.44)	4.636 (0.42)	4.511 (0.41)	3.985 (0.39)
600	4.716 (0.46)	5.006 (0.42)	4.860 (0.40)	4.370 (0.40)
750	5.011 (0.48)	5.264 (0.43)	5.089 (0.40)	4.657 (0.41)

Table 4. Table 4: Mean and (standard deviation) of SRMR for each method and reverberation time (best results in boldface).

Rev. time [ms]	Rev. Signal	Kameoka der.	Wisdom der.	Mixed penalization
300	4.297 (1.78)	2.901 (0.92)	5.269 (2.36)	5.207 (1.78)
450	3.020 (1.15)	2.173 (0.64)	3.907 (1.63)	4.305 (1.44)
600	2.378 (0.86)	1.786 (0.51)	3.175 (1.27)	3.698 (1.21)
750	2.003 (0.71)	1.551 (0.44)	2.727 (1.07)	3.301 (1.05)

Equations88

x (t) = (h * s) (t),

x (t) = (h * s) (t),

x_{k} (t) ≐ \int_{- \infty}^{\infty} x (u) w (u - t) e^{- 2 π i u k} d u, t, k \in R,

x_{k} (t) ≐ \int_{- \infty}^{\infty} x (u) w (u - t) e^{- 2 π i u k} d u, t, k \in R,

x_{k} [n] ≐ m = - \infty \sum \infty x [m] w [m - n] e^{- 2 π imk}, n, k \in N .

x_{k} [n] ≐ m = - \infty \sum \infty x [m] w [m - n] e^{- 2 π imk}, n, k \in N .

x_{k} [n] \approx \tilde{x}_{k} [n] ≐ τ = 0 \sum N_{h} - 1 s_{k} [n - τ] h_{k} [τ],

x_{k} [n] \approx \tilde{x}_{k} [n] ≐ τ = 0 \sum N_{h} - 1 s_{k} [n - τ] h_{k} [τ],

E ∣ \tilde{x}_{k} [n] ∣^{2}

E ∣ \tilde{x}_{k} [n] ∣^{2}

= E τ, τ^{'} \sum s_{k} [n - τ] s_{k}^{*} [n - τ^{'}] ∣ h_{k} [τ] ∣ e^{j ϕ_{k} [τ]} ∣ h_{k} [τ^{'}] ∣ e^{- j ϕ_{k} [τ^{'}]}

= τ, τ^{'} \sum s_{k} [n - τ] s_{k}^{*} [n - τ^{'}] ∣ h_{k} [τ] ∣ ∣ h_{k} [τ^{'}] ∣ E e^{j (ϕ_{k} [τ] - ϕ_{k} [τ^{'}])}

= τ, τ^{'} \sum s_{k} [n - τ] s_{k}^{*} [n - τ^{'}] ∣ h_{k} [τ] ∣ ∣ h_{k} [τ^{'}] ∣ δ_{τ τ^{'}}

= τ \sum ∣ s_{k} [n - τ] ∣^{2} ∣ h_{k} [τ] ∣^{2} .

X_{k} [n] = τ \sum S_{k} [n - τ] H_{k} [τ],

X_{k} [n] = τ \sum S_{k} [n - τ] H_{k} [τ],

Y_{k} [n] = X_{k} [n] + ϵ_{k} [n],

Y_{k} [n] = X_{k} [n] + ϵ_{k} [n],

π_{l ik e} (Y ∣ S, H) = k = 1 \prod K n = 1 \prod N \frac{1}{2 π σ} exp (- \frac{( Y _{k} [ n ] - X _{k} [ n ] ) ^{2}}{σ ^{2}}) .

π_{l ik e} (Y ∣ S, H) = k = 1 \prod K n = 1 \prod N \frac{1}{2 π σ} exp (- \frac{( Y _{k} [ n ] - X _{k} [ n ] ) ^{2}}{σ ^{2}}) .

π_{p r i or} (S) = k = 1 \prod K n = 1 \prod N \frac{1}{2Γ ( 1 + 1/ p ) b _{k}} exp (- \frac{∣ S _{k} [ n ] ∣ ^{p}}{b _{k}^{p}}),

π_{p r i or} (S) = k = 1 \prod K n = 1 \prod N \frac{1}{2Γ ( 1 + 1/ p ) b _{k}} exp (- \frac{∣ S _{k} [ n ] ∣ ^{p}}{b _{k}^{p}}),

π_{p r i or} (V) = k = 1 \prod K n = 2 \prod N_{h} \frac{1}{2 π η _{k}} exp (- \frac{V _{k} [ n ] ^{2}}{η _{k}^{2}}) .

π_{p r i or} (V) = k = 1 \prod K n = 2 \prod N_{h} \frac{1}{2 π η _{k}} exp (- \frac{V _{k} [ n ] ^{2}}{η _{k}^{2}}) .

π_{p os t} (S, H ∣ Y) \propto π_{l ik e} (Y ∣ S, H) π_{p r i or} (S) π_{p r i or} (H) .

π_{p os t} (S, H ∣ Y) \propto π_{l ik e} (Y ∣ S, H) π_{p r i or} (S) π_{p r i or} (H) .

- lo g π_{p os t} (S, H ∣ Y)

- lo g π_{p os t} (S, H ∣ Y)

J (S, H) ≐ k = 1 \sum K (∣∣ Y_{k} - X_{k} ∣ ∣_{2}^{2} + λ_{s, k} ∣∣ S_{k} ∣ ∣_{p}^{p} + λ_{h, k} ∣∣ L H_{k} ∣ ∣_{2}^{2}),

J (S, H) ≐ k = 1 \sum K (∣∣ Y_{k} - X_{k} ∣ ∣_{2}^{2} + λ_{s, k} ∣∣ S_{k} ∣ ∣_{p}^{p} + λ_{h, k} ∣∣ L H_{k} ∣ ∣_{2}^{2}),

(i) g (w, w) = f (w) and (ii) g (w, w^{'}) \geq f (w), \forall w, w^{'} \in Ω.

(i) g (w, w) = f (w) and (ii) g (w, w^{'}) \geq f (w), \forall w, w^{'} \in Ω.

w^{j} ≐ w arg min g (w, w^{j - 1}) .

w^{j} ≐ w arg min g (w, w^{j - 1}) .

g_{s} (S, S^{'}) ≐

g_{s} (S, S^{'}) ≐

+ k, n \sum λ_{s, k} (\frac{p}{2} S_{k}^{'} [n]^{p - 2} S_{k} [n]^{2} + ∣ S_{k}^{'} [n] ∣^{p} - \frac{p}{2} ∣ S_{k}^{'} [n] ∣^{p})

g_{s} (S, S) =

g_{s} (S, S) =

+ k \sum λ_{h, k} ∣∣ L H_{k}^{'} ∣ ∣_{2}^{2} + k, n \sum λ_{s, k} (\frac{p}{2} S_{k} [n]^{p - 2} S_{k} [n]^{2} + ∣ S_{k} [n] ∣^{p} - \frac{p}{2} ∣ S_{k} [n] ∣^{p})

=

+ k \sum λ_{h, k} ∣∣ L H_{k}^{'} ∣ ∣_{2}^{2} + k, n \sum λ_{s, k} ∣ S_{k} [n] ∣^{p}

=

=

P_{k, n} ≐

P_{k, n} ≐

R_{k, n} ≐

g_{s} (S, S^{'}) = k \sum (n \sum (P_{k, n} + λ_{s, k} Q (S_{k}^{'} [n])) + λ_{h, k} ∣∣ L H_{k}^{'} ∣ ∣_{2}^{2}),

g_{s} (S, S^{'}) = k \sum (n \sum (P_{k, n} + λ_{s, k} Q (S_{k}^{'} [n])) + λ_{h, k} ∣∣ L H_{k}^{'} ∣ ∣_{2}^{2}),

J (S, H^{'}) = k \sum (n \sum (R_{k, n} + λ_{s, k} ∣ S_{k} [n] ∣^{p}) + λ_{h, k} ∣∣ L H_{k}^{'} ∣ ∣_{2}^{2}) .

J (S, H^{'}) = k \sum (n \sum (R_{k, n} + λ_{s, k} ∣ S_{k} [n] ∣^{p}) + λ_{h, k} ∣∣ L H_{k}^{'} ∣ ∣_{2}^{2}) .

P_{k, n} - R_{k, n} =

P_{k, n} - R_{k, n} =

- (Y_{k} [n] - τ \sum S_{k} [τ] H_{k}^{'} [n - τ])^{2}

=

=

=

=

=

=

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Blind Source Separation Techniques · Advanced Adaptive Filtering Techniques

Full text

Mixed penalization in convolutive nonnegative matrix factorization for blind speech dereverberation

Francisco J. Ibarrola Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional, sinc(i), FICH-UNL/CONICET, Argentina. Ciudad Universitaria, CC 217, Ruta Nac. 168, km 472.4, (3000) Santa Fe, Argentina. ([email protected]).

Leandro E. Di Persia ∗

Ruben D. Spies Instituto de Matemática Aplicada del Litoral, IMAL, CONICET-UNL, Centro Científico Tecnológico CONICET Santa Fe, Colectora Ruta Nac. 168, km 472, Paraje “El Pozo”, (3000), Santa Fe, Argentina and Departamento de Matemática, Facultad de Ingeniería Química, Universidad Nacional del Litoral, Santa Fe, Argentina.

Abstract

When a signal is recorded in an enclosed room, it typically gets affected by reverberation. This degradation represents a problem when dealing with audio signals, particularly in the field of speech signal processing, such as automatic speech recognition. Although there are some approaches to deal with this issue that are quite satisfactory under certain conditions, constructing a method that works well in a general context still poses a significant challenge. In this article, we propose a method based on convolutive nonnegative matrix factorization that mixes two penalizers in order to impose certain characteristics over the time-frequency components of the restored signal and the reverberant components. An algorithm for implementing the method is described and tested. Comparisons of the results against those obtained with state of the art methods are presented, showing significant improvement.

Keywords: signal processing, dereverberation, regularization.

1 Introduction

In recent years, many technological developments have attracted attention towards human-machine interaction. Since the most natural and easiest way of human communication is trough speech, much research effort has been put into achieving the same natural interaction with machines. This effort has already generated many advances in a wide variety of fields such as automatic speech recognition ([1]), automatic translation systems ([2]) and control of remote devices trough voice ([3]), to name only a few. A significant amount of work has been recently devoted to produce robustness in speech recognition ([4]), resulting in several advances in the areas of speech enhancement ([1], [5]), multiple sources separation ([6], [7]), and particularly in dereverberation techniques ([8]), which constitute the topic of this work.

When recorded in enclosed rooms, audio signals will most certainly be affected by reverberant components due to reflections of the sound waves in the walls, ceiling, floor or furniture. This can severely degrade the characteristics of the recorded signal ([9]), generating difficult problems for its processing, particularly when required for certain speech applications ([10]). The goal of any dereverberation technique is to remove or to attenuate the reverberant components in order to obtain a cleaner signal. The dereverberation problem is called “blind” when the available data consists only of the reverberant signal itself, and this is the problem we shall deal with in this work.

Depending on the problem, our observation might consist of a single or multi-channel signal. That is, we might have a signal recorded by one or more microphones. For the latter case, quite a few methods exist that work relatively well ([11], [12]).

For the single-channel case, we may distinguish between supervised and unsupervised approaches. The first kind refers to those that begin with a training stage that serves to learn some characteristics of the reververation conditions, while the second kind alludes to those methods that can be implemented directly over the reverberant signal. Some supervised methods ([13], [14], [15]) appear to perform somewhat better than unsupervised ones, but they pose the disadvantage of needing learning data corresponding to the specific room conditions, microphone and source locations, and a previous process that might take a significant amount of time.

In the context of unsupervised blind dereverberation, although some recently proposed methods ([12], [16]) work reasonably well, there is still much room for improvement. Our work is based on a convolutive non-negative matrix factorization (NMF) reverberation model, as proposed by Kameoka et al ([16]), along with a Bayesian approach for building a generalized functional that mixes two types of penalizers over the elements of the representation model. Mixed penalization approaches have been recently used and successfully applied by several authors in many areas, mainly in signal and image processing applications ([17], [18], [19], [20], [21]). These techniques have shown to produce good results in terms of enhancing certain desirable characteristics on the solutions while precluding unwanted ones.

1.1 A Reverberation Model

Let $s,x:\mathbb{R}\rightarrow\mathbb{R}$ , with support in $[0,\infty)$ , be the functions associated to the clean and reverberant signals, respectively. As it is customary, we shall assume that the reverberation process is well represented by a Linear Time-Invariant (LTI) system. Thus, the reverberation model can be written as

[TABLE]

where $h:\mathbb{R}\rightarrow\mathbb{R}$ is the room impulse response (RIR) signal, and “ $\ast$ ” denotes convolution. This LTI hypothesis implies we are assuming the source and microphone positions to be static, and the energy of the signal to be low enough for the effect of the non-linear components to be relatively insignificant.

When dealing with sound signals (particularly speech signals), it is often convenient to work with the associated spectrograms rather than the signals themselves. Thus, we make use of the short time Fourier transform (STFT), defined as

[TABLE]

where $w:\mathbb{R}\rightarrow\mathbb{R}^{+}_{0}$ is a compactly supported, even function such that $\|w\|_{1}=1$ . This function is called window.

In practice, we work with discretized versions of the signals involved ( $x[\cdot],h[\cdot],s[\cdot],$ and $w[\cdot]$ ). With this in mind, we shall define the discrete STFT as

[TABLE]

Denoting the STFTs of $s$ and $h$ by $\mathbf{s}_{k}[n]$ and $\mathbf{h}_{k}[n]$ , respectively, a discretized approximation of the STFT model associated to (1) is given by

[TABLE]

where $n=1,\ldots,N,$ is a discretized time variable that corresponds to window location, $k=1,\ldots,K,$ denotes the frequency subband and $N_{h}$ is a parameter of the model associated to the expected maximum duration of the reverberation phenomenon. The model is built as in [22], being the approximation due to the use of badn-to-band filters only. Later on, the values of $n$ will be chosen in such a way that the union of the windows’ supports contain the support of the observed signal, and the values of $k$ in such a way that they cover the whole frequency spectrum, up to half the sampling frequency.

Now, let us write $\mathbf{h}_{k}[\tau]=|\mathbf{h}_{k}[\tau]|e^{j\phi_{k}[\tau]}$ . It is well known ([23]) that the phase angles $\phi_{k}[\tau]$ are highly sensitive with respect to mild variations on the reverberation conditions. To overcome the problems derived from this, we shall proceed (see [16]) treating the $K\times N_{h}$ variables $\phi_{k}[\tau]$ as i.i.d. random variables with uniform distribution in $[-\pi,\pi)$ . Denoting the complex conjugate by “∗” and the Kronecker delta by $\delta_{ij}$ , the expected value of $|\tilde{\mathbf{x}}_{k}[t]|^{2}$ is given by

[TABLE]

Note that the $[-\pi,\pi)$ interval choice for $\phi_{k}[\tau]$ is arbitrary, since this result holds for any $2\pi-$ length interval. Finally, let us define $S_{k}[n]\doteq|\mathbf{s}_{k}[n]|^{2}$ , $H_{k}[n]\doteq|\mathbf{h}_{k}[n]|^{2}$ and $X_{k}[n]\doteq E|\tilde{\mathbf{x}}_{k}[n]|^{2}$ . Then, our model reads

[TABLE]

and the square magnitude of the observed spectrogram components can be written as

[TABLE]

where $\epsilon_{k}[n]$ denotes the representation error. As shown in [16], this model is equivalent to a convolutive NMF ([24]) with diagonal basis. In the next section, we derive a cost function in order to find an appropriate convolutive representation that allows us to isolate the components $S_{k}[n]$ .

2 A Bayesian approach

In the following, we will use a Bayesian approach to derive a cost function which we will then minimize in order to obtain our regularized solution. Let us begin by assuming, for every $k$ , $\epsilon_{k}[n],S_{k}[n],H_{k}[n]$ are independent random variables, also independent with respect to $k$ . Also, let us denote by $S,Y,X\in\mathbb{R}^{K\times N}$ and $H\in\mathbb{R}^{K\times N_{h}}$ the non-negative matrices whose $(k,n)$ -th elements are $S_{k}[n],Y_{k}[n],X_{k}[n]$ and $H_{k}[n]$ , respectively.

As it is customary ([16]), for the representation error, we assume $\epsilon_{k}[n]\sim\mathcal{N}(0,\sigma^{2})$ , where $\sigma>0$ is an unknown parameter, and the variables are non-correlated with respect to $n$ . Hence, it follows from (4) that the conditional distribution of $Y$ given $S$ and $H$ (i.e. the likelihood distribution) is given by

[TABLE]

Let us now turn our attention to $S$ . Figure 1 depicts the $\log$ -spectrograms for a clean signal and its reverberant version. As it can be observed, while the spectrogram of the clean signal is somewhat sparse, the one corresponding to the reverberant signal presents a smoother or more diffuse structure. The presence of discontinuities in the spectrogram of the clean signal can be favored by assuming $S$ follows a generalized Gaussian distribution ([25]). Namely,

[TABLE]

where $p\in(0,2)$ is a prescribed parameter and $b_{k}>0$ is unknown.

In regards to $H$ , although no general conditions are expected on its individual components, we do expect its first order time differences to exhibit a certain degree of regularity (see Figures 2 and 3). In fact, if windows are set close enough relative to the duration of the reverberation phenomenon, then consecutive time components of $H$ will capture overlapped information, which along with the exponential decay characteristic of the RIR ([26]) accounts for a somewhat smooth structure. Therefore, we define the time differences matrix $V\in\mathbb{R}^{K\times(N_{h}-1)}$ , with components $V_{k}[n]\doteq H_{k}[n]-H_{k}\left[n-1\right]\;\forall n=1,\ldots,N_{h}-1,\;k=1,\ldots,K$ . The regularity of these variations is contemplated by assuming $V$ follows a normal distribution:

[TABLE]

Using Bayes’ theorem, the a posteriori joint distribution of $S$ and $H$ conditioned to $Y$ satisfies

[TABLE]

2.1 Mixed penalization

Our goal is to find $\hat{S}$ and $\hat{H}$ that are representative of the a posteriori distribution (5). Although the immediate instinct might be to compute the expected value, there are quite a few other ways to proceed, with different degrees of reliability and complexity. In lights of the assumed distributions and the high dimensionality of the problem, the maximum a posteriori (MAP) estimator is a reasonable choice in this case. Note that maximizing (5) is tantamount to minimizing $-\log\pi_{post}(S,H|Y)$ . If we denote by $S_{k},Y_{k},X_{k}\in\mathbb{R}^{N}$ , $H_{k}\in\mathbb{R}^{N_{h}}$ and $V_{k}\in\mathbb{R}^{N_{h}-1}$ the (transposed) rows of $S,Y,X,H$ and $V$ , and define $L\in\mathbb{R}^{N_{h}-1\times N_{h}}$ in such a way that $LH_{k}=V_{k}$ , then

[TABLE]

where $C$ is a constant which does not depend on $S$ nor $H$ .

Finally, the latter equation leads to the cost function

[TABLE]

which shall be minimized to find our regularized solution. In this context, $\lambda_{s,k},\lambda_{h,k}\geq 0$ can be thought of as penalization parameters weighting both penalizers relative to the fidelity term, whereas the exponent $p\in(0,2)$ is a tunning parameter. It is timely to point out that small values of $p$ will promote sparsity, whereas values close to $2$ will promote smoothness. Since there is a clear scale indeterminacy in the representation (3), we impose the (somewhat arbitrary) additional constraint $||S_{k}||_{\infty}=||Y_{k}||_{\infty}\;\forall k$ , which means that the maximum values shall remain equal for every frequency.

2.2 Regularization parameters

As mentioned before, the parameters $\lambda_{h,k},\lambda_{s,k},\;k=1,\ldots,K,$ weight the penalizers against the fidelity term. In this sense, the optimal weights of these regularization parameters might vary as a function of the frequency subband, and hence their proposed dependency on $k$ . Since searching blindly for $2K$ parameters is non-viable in practice, we quantify this dependency by defining $\lambda_{h,k}\doteq\lambda_{h}\sum_{n=1}^{N}|Y_{k}[n]|^{2}$ and $\lambda_{s,k}\doteq\lambda_{s}\;\forall k=1,\ldots,K$ (note that the relation between $S_{k}$ and $Y_{k}$ is already contemplated in the constraint that intends to avoid scale indeterminacy). This means we only need to look for two parameters ( $\lambda_{h},\lambda_{s}$ ) and then multiply $\lambda_{h}$ by the energy of the signal associated to each row of $Y$ .

Next, we present an algorithm for approximating matrices $H$ and $S$ minimizing $J$ .

3 Updating rules

We shall build an iterative algorithm following the idea in [16], which is based on the auxiliary function technique.

Let $\Omega\subset\mathbb{R}$ and $f:\Omega\rightarrow\mathbb{R}_{0}^{+}$ . Then, $g:\Omega\times\Omega\rightarrow\mathbb{R}_{0}^{+}$ is called an auxiliary function for $f$ if

[TABLE]

Let $w^{0}\in\Omega$ be arbitrary, and let

[TABLE]

With this definition, it can be shown ([27]) that the sequence $\{f(w^{j})\}_{j}$ is non-increasing. We intend to use this property as a tool for alternatively updating the matrices $H$ and $S$ . Let us begin by fixing $H=H^{\prime}$ , where $H^{\prime}$ is an arbitrary $K\times N_{h}$ matrix. We will show that

[TABLE]

is an auxiliary function for $J$ (as defined in (6)) with respect to $S$ . From this point on, we denote by $X^{\prime}_{k}[n]=\sum_{\tau}S^{\prime}_{k}[n-\tau]H^{\prime}_{k}[\tau]$ . The equality condition $(i)$ in (7) is rather straightforward. In fact,

[TABLE]

To prove condition $(ii)$ in (7) we begin by defining

[TABLE]

and $Q:\mathbb{R}^{+}\rightarrow\mathbb{R}$ such that $Q(x)\doteq\frac{p}{2}x^{p-2}S_{k}[n]^{2}+x^{p}-\frac{p}{2}x^{p}$ . With these definitions, we can write

[TABLE]

and

[TABLE]

Hence, to prove that $g_{s}(S,S^{\prime})\geq J(S,H^{\prime})\;\forall S,S^{\prime}$ it is sufficient to show that $P_{k,n}\geq R_{k,n}$ and $Q(S^{\prime}_{k}[n])\geq|S_{k}[n]|^{p}\;\forall n=1,\ldots,N,k=1,\ldots,K$ . In fact,

[TABLE]

To prove that $Q(S^{\prime}_{k}[n])\geq|S_{k}[n]|^{p}$ , we begin by noting that $Q\in\mathcal{C}^{\infty}(\mathbb{R}^{+})$ . Then, the first order necessary condition for $Q$ yields

[TABLE]

meaning the only point at which the derivative of $Q$ equals zero is at $x=S_{k}[n]$ . Furthermore, $\frac{\partial^{2}}{\partial x^{2}}Q(S_{k}[n])=S_{k}[n]^{p-2}(2p-p^{2})>0\;\forall p\in(0,2)$ , meaning that $Q(S_{k}[n])=|S_{k}[n]|^{p}$ is the global minimum of $Q$ . This yields

[TABLE]

$\blacksquare$

In an analogous way, it can be shown that if we let $S=S^{\prime}$ be fixed, where $S^{\prime}$ is an arbitrary $K\times N$ matrix, then

[TABLE]

is an auxiliary function for $J(S^{\prime},H)$ with respect to $H$ .

Having defined auxiliary functions, we will use the updating rule derived from (8) to build an algorithm for iteratively approaching matrices $S$ and $H$ minimizing $J$ . Notice this requires minimizing $g_{s}$ and $g_{h}$ with respect to the updating variables, but since $g_{s}$ is quadratic with respect to $S$ and $g_{h}$ is quadratic with respect to $H$ , we can simply use the first order necessary conditions in both cases. From this point on, in the context of the iterative updating process, $S^{\prime}$ and $H^{\prime}$ will refer not to arbitrary nonnegative matrices, but to those estimations of $S$ and $H$ obtained in the immediately previous step.

3.1 Updating rule for S

Firstly, we shall derive an updating rule for $S_{k}[\tau]$ . That is, we wish to minimize $g_{s}$ w.r.t. $S$ . The first order necessary condition on $g_{s}$ yields

[TABLE]

which easily leads to the multiplicative updating rule

[TABLE]

In order to avoid the aforementioned scale indeterminacy, every updating step is to be followed by scaling $S_{k}$ so that its $\ell^{\infty}$ norm coincides with that of the observation $Y_{k}$ .

3.2 Updating rule for H

In order to find an updating rule for $H$ , we shall write $g_{h}$ as a function of the transposed rows $H_{k}$ . We begin by noting

[TABLE]

Next, we define the diagonal matrices $A^{k},B^{k}\in\mathbb{R}^{N_{h}\times N_{h}}$ , whose diagonal elements are $A^{k}_{\tau,\tau}\doteq\sum_{n}S^{\prime}_{k}[n-\tau]X^{\prime}_{k}[n]$ and $B^{k}_{\tau,\tau}\doteq H^{\prime}_{k}[\tau]$ , and the vector $\zeta^{k}\in\mathbb{R}^{N_{h}}$ with components $\zeta^{k}_{\tau}=\sum_{n}S^{\prime}_{k}[n-\tau]Y_{k}[n]$ . With these definitions, we can write

[TABLE]

Now, the first order necessary condition for $g_{h}$ with respect to $H_{k}$ is given by

[TABLE]

which readily leads to an updating rule consisting of solving the linear system

[TABLE]

Let us notice that under the assumption that the diagonal elements of $A^{k}$ and $B^{k}$ are strictly positive, and since $L^{\text{\scriptsize{T}}}L$ is positive-semidefinite, $(B^{k})^{-1}A^{k}+\lambda_{h,k}L^{\text{\scriptsize{T}}}L$ is positive-definite, and hence the linear system has a unique solution. The assumption of $A^{k}_{\tau,\tau}>0$ is adequate, since these elements correspond to the discrete convolution of $S^{\prime}_{k}$ and $X^{\prime}_{k}$ . Although the validity of the hypothesis over $B^{k}_{\tau,\tau}$ is not so clear, in practice, the matrix in system (11) has turned out to be non-singular. Nonetheless, $H_{k}$ can be computed as the best approximate solution in the least-squares sense. Then, solving this $N_{h}\times N_{h}$ linear system entails no challenge, since $N_{h}$ is usually chosen relatively small, depending on the window step and the reverberation time.

All the steps for the dereverberation process are stated in Algorithm 1. Note that in the initialization we define the clean spectrogram $S$ equal to the observation, which is natural since in a way they both correspond to the same signal, and $H_{k}$ as a vector with exponential time decay, which is an expected characteristic of a RIR. Finally, we set the stopping criterion over the decay of the norm of two consecutive approximations of $S$ . This has shown to work quite well, although other stopping criteria might be considered.

Results to illustrate the performance of the algorithm are presented in the next section.

4 Experimental results

For the experiments, we took $110$ speech signals from the TIMIT database ([28]), recorded at 16 KHz, and artificially made them reverberant by convolution with impulse responses generated with the software Room Impulse Response Generator111https://github.com/ehabets/RIR-Generator, based on the model in [29]. Each signal was degraded under different reverberation conditions: three different room sizes, each with three different microphone positions and four different reverberation times.222A web demo for our algorithm can be found in http://fich.unl.edu.ar/sinc/web-demo/blindder/

In order to avoid preprocessing, the choice of the regularization parameters was made a priori by means of empirical rules, based upon signals from a different database. This is supported by the fact that the parameters were observed to be rather robust with respect to variations of the reverberation conditions, and hence they were chosen simply as $\lambda_{h}=1$ and $\lambda_{s}=10^{-4}$ . The rest of the model parameters were chosen as specified in Table 1.

Let us point out that the choice of $N_{h}$ was done as to allow $H$ to capture early reverberation while precluding overlapped representations. In the first place, it is desirable for $H$ to represent the RIR along the full Early Decay Time (EDT), the time period in which the reverberation phenomenon alters the clean signal the most, so its effect can be effectively nullified. On the other hand, if we were to choose $N_{h}$ too large, it might lead certain similarities in the observation $Y$ within a fixed frequency range to be represented as echoes from high energy components of $S$ . It is worth mentioning, however, that the performance of our dereverberation method has shown no high sensitivity with respect to the choice of $N_{h}$ .

In order to evaluate the performance of our model, we made comparisons against two state of the art methods that work under the same conditions. The one proposed by Kameoka et al in [16], choosing all the parameters as suggested, and the one proposed by Wisdom et al in [12], with a window length of $2048$ .

To measure performance, following [30], we made use of the frequency weighted segmental signal-to-noise ratio (fwsSNR) and cepstral distance. Furthermore, we also measured the speech-to-reverberation modulation energy ratio (SRMR, [31]), which has the advantage of being non-intrusive (it does not use the clean signal as an input). The results for each performance measure are stated in Tables 2-4 and depicted in Figures 4- 6, classified in function of the reverberation times: $300[\text{ms}]$ , $450[\text{ms}]$ , $600[\text{ms}]$ and $750[\text{ms}]$ . Notice that for the cases of fwsSNR and SRMR, higher values correspond to better performance, while for the cepstral distance, small values indicate higher quality.

In regard to the fwsSNR performance measure, the values in Table 2 (Figure 4) give account of significant improvements of our proposed method with respect to the other two. This improvement becomes more evident as the reverberation time increases. As for the cepstral distance, although the results in Table 3 (Figure 5) account for a better performance of our proposed method, the quality with respect to the reverberant signal is improved only for reverberation times of 450[ms] or more. Finally, the SRMR also shows an improvement with respect to the other methods for reverberation times of 450[ms] or greater (see Table 4, Figure 6).

5 Conclusions

In this work, a new blind dereverberation method for speech signals based on regularization over a convolutive NMF representation of the signal spectrograms was introduced and tested. Results show a significant improvement over the state of the art methods, specially for high reverberation times. There is certainly much room for improvement, e.g. finding ways of optimally choosing the regularization parameters, exploring the use of other penalizers, etc.

Acknowledgements

This work was supported in part by Consejo Nacional de Investigaciones Científicas y Técnicas, CONICET through PIP 2014-2016 N ${}^{\text{o}}$ 11220130100216-CO, the Air Force Office of Scientific Research, AFOSR/SOARD, through Grant FA9550-14-1-0130, by Universidad Nacional del Litoral, UNL, through CAID-UNL 2011 N ${}^{\text{o}}$ 50120110100519 “Procesamiento de Señales Biomédicas.” and CAI+D-UNL 2016, PIC 50420150100036LI “Problemas Inversos y Aplicaciones a Procesamiento de Señales e Imágenes”.

Bibliography31

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] M. Kim, H.-M. Park, Efficient online target speech extraction using DOA-constrained independent component analysis of stereo data for robust speech recognition, Signal Processing 117 (2015) 126–137.
2[2] S. Yun, Y. J. Lee, S. H. Kim, Multilingual speech-to-speech translation system for mobile consumer devices, IEEE Transactions on Consumer Electronics 60 (3) (2014) 508–516.
3[3] R. Neßelrath, M. M. Moniri, M. Feld, Combining speech, gaze, and micro-gestures for the multimodal control of in-car functions, in: Intelligent Environments (IE), 2016 12th International Conference on, IEEE, 2016, pp. 190–193.
4[4] L. Di Persia, D. Milone, H. L. Rufiner, M. Yanagida, Perceptual evaluation of blind source separation for robust speech recognition, Signal Processing 88 (10) (2008) 2578–2583.
5[5] C. E. Martínez, J. Goddard, L. E. Di Persia, D. H. Milone, H. L. Rufiner, Denoising sound signals in a bioinspired non-negative spectro-temporal domain, Digital Signal Processing 38 (2015) 22–31.
6[6] L. Di Persia, D. Milone, M. Yanagida, Indeterminacy free frequency-domain blind separation of reverberant audio sources., IEEE Trans. Audio, Speech and Lang. Proc. 17 (2) (2009) 299–311.
7[7] L. E. Di Persia, D. H. Milone, Using multiple frequency bins for stabilization of FD-ICA algorithms, Signal Processing 119 (2016) 162–168.
8[8] A. Tsilfidis, J. Mourjopoulos, Signal-dependent constraints for perceptually motivated suppression of late reverberation, Signal Processing 90 (3) (2010) 959–965.