Blind Audio Source Separation with Minimum-Volume Beta-Divergence NMF

Valentin Leplat; Nicolas Gillis; Man Shun Ang

arXiv:1907.02404·eess.SP·July 15, 2020

Blind Audio Source Separation with Minimum-Volume Beta-Divergence NMF

Valentin Leplat, Nicolas Gillis, Man Shun Ang

PDF

TL;DR

This paper introduces a novel NMF-based model with a volume penalty for blind audio source separation, demonstrating improved interpretability and automatic source number estimation under noiseless conditions.

Contribution

The paper proposes a new NMF model with a volume penalty term, providing theoretical guarantees for source identification and automatic model order selection.

Findings

01

More interpretable separation results compared to standard NMF.

02

Effective source recovery even when the number of sources is overestimated.

03

Automatic zeroing of sources in overestimated scenarios.

Abstract

Considering a mixed signal composed of various audio sources and recorded with a single microphone, we consider on this paper the blind audio source separation problem which consists in isolating and extracting each of the sources. To perform this task, nonnegative matrix factorization (NMF) based on the Kullback-Leibler and Itakura-Saito $β$ -divergences is a standard and state-of-the-art technique that uses the time-frequency representation of the signal. We present a new NMF model better suited for this task. It is based on the minimization of $β$ -divergences along with a penalty term that promotes the columns of the dictionary matrix to have a small volume. Under some mild assumptions and in noiseless conditions, we prove that this model is provably able to identify the sources. In order to solve this problem, we propose multiplicative updates whose derivations are based on…

Tables4

Table 1. TABLE I : Differentiable convex-concave-constant decomposition of the β 𝛽 \beta -divergence under the form ( 5 ) [ 5 ] .

	$\overset{ˇ}{d} (x \| y)$	$\hat{d} (x \| y)$	$\bar{d} (x)$
$β = 0$	$x y^{- 1}$	$\log (y)$	$x (\log (x) - 1)$
$β \in [1, 2]$	$d_{β} (x \| y)$	0	0

Table 2. TABLE II : Multiplicative update for min-vol KL-NMF.

$W = \tilde{W} ⊙ \frac{[{[{[J_{F, N} H^{T} - 4 λ (\tilde{W} Y^{-})]}^{.2} + 8 λ \tilde{W} (Y^{+} + Y^{-}) ⊙ (\frac{[V]}{[\tilde{W} H]} H^{T})]}^{. \frac{1}{2}} - (J_{F, N} H^{T} - 4 λ (\tilde{W} Y^{-}))]}{[4 λ \tilde{W} (Y^{+} + Y^{-})]}$ ,
where $A ⊙ B$ (resp. $\frac{[A]}{[B]}$ ) is the Hadamard product (resp. division) between $A$ and $B$ , $A^{(. α)}$ is the element-wise $α$ exponent of $A$ ,
$J_{F, N}$ is the $F$ -by- $N$ all-one matrix, and $Y = Y^{+} - Y^{-} = {({\tilde{W}}^{T} \tilde{W} + δ I)}^{- 1}$ with $δ > 0$ , $Y^{+} \geq, Y^{-} \geq 0$ , $λ > 0$ .

Table 3. TABLE III : SDR, SIR and SAR metrics comparison for results obtained with baseline KL-NMF and min-vol KL-NMF on a synthetic mix of bass and drums

Algorithms	Source 1: bass			Source 2: drums
	SDR(dB)	SIR(dB)	SAR(dB)	SDR(dB)	SIR(dB)	SAR(dB)
min-vol KL-NMF	-1.14	0.12	7.78	9.60	19.8	10.09
baseline KL-NMF	-4.26	-1.39	2.64	7.97	9.00	15.25
sparse KL-NMF	-4.69	-1.73	2.33	7.89	8.96	14.98

Table 4. TABLE IV : Runtime performance in seconds of baseline KL-NMF, min-vol KL-NMF (Algorithm 1 ) and sparse KL-NMF [ 18 ] . The table reports the average and standard deviation over 20 random initializations for three experimental setups described in the text.

Algorithms	runtime in seconds
	setup $♯$ 1	setup $♯$ 2	setup $♯$ 3
baseline KL-NMF	0.44 $\pm$ 0.03	0.43 $\pm$ 0.01	3.81 $\pm$ 0.19
min-vol KL-NMF	3.79 $\pm$ 0.13	2.39 $\pm$ 0.30	10.19 $\pm$ 1.28
sparse KL-NMF	0.20 $\pm$ 0.02	0.20 $\pm$ 0.01	2.21 $\pm$ 0.26

Equations52

x(t)=\sum_{k=1}^{K}\color[rgb]{0,0,0}s^{(k)}(t)\color[rgb]{0,0,0}\quad\mathrm{with}\thinspace t=0,...,T-1\,.

x(t)=\sum_{k=1}^{K}\color[rgb]{0,0,0}s^{(k)}(t)\color[rgb]{0,0,0}\quad\mathrm{with}\thinspace t=0,...,T-1\,.

X=S(x(t))=S\left(\sum_{k=1}^{K}\color[rgb]{0,0,0}s^{(k)}(t\color[rgb]{0,0,0})\right)=\sum_{k=1}^{K}S^{(k)},

X=S(x(t))=S\left(\sum_{k=1}^{K}\color[rgb]{0,0,0}s^{(k)}(t\color[rgb]{0,0,0})\right)=\sum_{k=1}^{K}S^{(k)},

W \geq 0, H \geq 0 min D (V ∣ W H) = f, n \sum d (V_{f n} ∣ [W H]_{f n}),

W \geq 0, H \geq 0 min D (V ∣ W H) = f, n \sum d (V_{f n} ∣ [W H]_{f n}),

\left\{\begin{array}[]{lr}\frac{1}{\beta\left(\beta-1\right)}\left(x^{\beta}+\left(\beta-1\right)y^{\beta}-\beta xy^{\beta-1}\right)\text{ for }\beta\neq 0,1,\\ x\log\frac{x}{y}-x+y\text{ for }\beta=1,\\ \frac{x}{y}-\log\frac{x}{y}-1\text{ for }\beta=0.\end{array}\right.

\left\{\begin{array}[]{lr}\frac{1}{\beta\left(\beta-1\right)}\left(x^{\beta}+\left(\beta-1\right)y^{\beta}-\beta xy^{\beta-1}\right)\text{ for }\beta\neq 0,1,\\ x\log\frac{x}{y}-x+y\text{ for }\beta=1,\\ \frac{x}{y}-\log\frac{x}{y}-1\text{ for }\beta=0.\end{array}\right.

v_{n} \in cone (W) = {v \in R^{F} ∣ v = W θ, θ \geq 0},

v_{n} \in cone (W) = {v \in R^{F} ∣ v = W θ, θ \geq 0},

W (:, j) \in Δ^{F} \forall j, H \geq 0 min

W (:, j) \in Δ^{F} \forall j, H \geq 0 min

W \in R^{F \times K}, H \in R^{K \times N} min

W \in R^{F \times K}, H \in R^{K \times N} min

V = W H, W^{T} e = e, H \geq 0,

W \geq 0 min

W \geq 0 min

d_{β} (x ∣ y) = \overset{ˇ}{d}_{β} (x ∣ y) + \hat{d}_{β} (x ∣ y) + \overset{ˉ}{d}_{β} (x ∣ y),

d_{β} (x ∣ y) = \overset{ˇ}{d}_{β} (x ∣ y) + \hat{d}_{β} (x ∣ y) + \overset{ˉ}{d}_{β} (x ∣ y),

G (w ∣ \tilde{w})

G (w ∣ \tilde{w})

+ [\hat{d}^{^{'}} (v_{n} ∣ \tilde{v_{n}}) k \sum (w_{k} - \tilde{w_{k}}) h_{k n} + \hat{d} (v_{n} ∣ \tilde{v_{n}})]

logdet (B)

logdet (B)

= trace (A^{- 1} B) + logdet (A) - K .

logdet (W^{T} W + δ I) \leq l (W, Z),

logdet (W^{T} W + δ I) \leq l (W, Z),

\bar{l}(w|\tilde{w})=l(\tilde{w})+\Delta w^{T}\nabla l\left(\tilde{w}\right)+\frac{1}{2}\Delta w^{T}\color[rgb]{0,0,0}\Phi(\tilde{w})\color[rgb]{0,0,0}\Delta w,

\bar{l}(w|\tilde{w})=l(\tilde{w})+\Delta w^{T}\nabla l\left(\tilde{w}\right)+\frac{1}{2}\Delta w^{T}\color[rgb]{0,0,0}\Phi(\tilde{w})\color[rgb]{0,0,0}\Delta w,

\overset{ˉ}{F} (W ∣ \tilde{W}) = f \sum G (w_{f} ∣ \tilde{w}_{f}) + λ f \sum \overset{ˉ}{l} (w_{f} ∣ \tilde{w}_{f}) + c,

\overset{ˉ}{F} (W ∣ \tilde{W}) = f \sum G (w_{f} ∣ \tilde{w}_{f}) + λ f \sum \overset{ˉ}{l} (w_{f} ∣ \tilde{w}_{f}) + c,

mask_{f, n}^{(k)} = \frac{X ^ _{f, n}^{(k)}}{\sum _{k} X ^ _{f, n}^{(k)}} with k = 1, ..., K,

mask_{f, n}^{(k)} = \frac{X ^ _{f, n}^{(k)}}{\sum _{k} X ^ _{f, n}^{(k)}} with k = 1, ..., K,

e^{T} \hat{W} = e^{T} W^{#} A^{- 1} = e^{T} A^{- 1} = e^{T},

e^{T} \hat{W} = e^{T} W^{#} A^{- 1} = e^{T} A^{- 1} = e^{T},

∣ M_{ii} ∣ \geq j \neq = i \sum ∣ M_{ij} ∣ for all i .

∣ M_{ii} ∣ \geq j \neq = i \sum ∣ M_{ij} ∣ for all i .

M_{ii}

M_{ii}

M_{ij}

M_{ii} - j \neq = i \sum ∣ M_{ij} ∣

M_{ii} - j \neq = i \sum ∣ M_{ij} ∣

- 2 w_{i} j \neq = i \sum ∣ Y_{ij} ∣ w_{j}

= 2 w_{i} ∣ Y_{ii} ∣ w_{i} - 2 w_{i} Y_{ii} w_{i} \geq 0,

\nabla_{w_{k}} \overset{ˉ}{F} (w ∣ \tilde{w})

\nabla_{w_{k}} \overset{ˉ}{F} (w ∣ \tilde{w})

+ 2 λ [Diag (\frac{Y ^{+} w ~ + Y ^{-} w ~}{w ~})]_{k} w_{k}

- 2 λ [Diag (\frac{Y ^{+} w ~ + Y ^{-} w ~}{w ~})]_{k} \tilde{w}_{k} .

\tilde{a} = 2 λ [Diag (\frac{Y ^{+} w ~ + Y ^{-} w ~}{w ~})]_{k},

\tilde{a} = 2 λ [Diag (\frac{Y ^{+} w ~ + Y ^{-} w ~}{w ~})]_{k},

\tilde{b} = n \sum \frac{h _{k n}}{v ~ _{n}} - 4 λ [Y^{-} \tilde{w}]_{k},

\tilde{d} = - n \sum h_{k n} \frac{w ~ _{k}^{2} v _{n}}{v ~ _{n}^{2}} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Blind Audio Source Separation with

Minimum-Volume Beta-Divergence NMF

Valentin Leplat, Nicolas Gillis, Andersen M.S. Ang

Department of Mathematics and Operational Research, Faculté Polytechnique, Université de Mons, Rue de Houdain 9, 7000 Mons, Belgium. Authors acknowledge the support by the Fonds de la Recherche Scientifique - FNRS and the Fonds Wetenschappelijk Onderzoek - Vlanderen (FWO) under EOS Project no O005318F-RG47, and by the European Research Council (ERC starting grant no 679515). E-mails: {valentin.leplat, nicolas.gillis, manshun.ang}@umons.ac.be. Manuscript received in July 2019. Accepted April 2020.

Abstract

Considering a mixed signal composed of various audio sources and recorded with a single microphone, we consider on this paper the blind audio source separation problem which consists in isolating and extracting each of the sources. To perform this task, nonnegative matrix factorization (NMF) based on the Kullback-Leibler and Itakura-Saito $\beta$ -divergences is a standard and state-of-the-art technique that uses the time-frequency representation of the signal. We present a new NMF model better suited for this task. It is based on the minimization of $\beta$ -divergences along with a penalty term that promotes the columns of the dictionary matrix to have a small volume. Under some mild assumptions and in noiseless conditions, we prove that this model is provably able to identify the sources. In order to solve this problem, we propose multiplicative updates whose derivations are based on the standard majorization-minimization framework. We show on several numerical experiments that our new model is able to obtain more interpretable results than standard NMF models. Moreover, we show that it is able to recover the sources even when the number of sources present into the mixed signal is overestimated. In fact, our model automatically sets sources to zero in this situation, hence performs model order selection automatically.

Index Terms:

nonnegative matrix factorization, $\beta$ -divergences, minimum-volume regularization, identifiability, blind audio source separation, model order selection

I Introduction

Blind audio source separation concerns the techniques used to extract unknown signals called sources from a mixed audio signal $x$ . In this paper, we assume that the audio signal is recorded with a single microphone. Considering a mixed signal composed of various audio sources, the blind audio source separation consists in isolating and extracting each of the sources on the basis of the single recording. Usually, the only known information is the number of estimated sources present in the mixed signal. The blind source separation problem is said to be underdetermined as there are fewer sensors (only one in our case) than sources. It then appears necessary to find additional information to make the problem well posed. The most common technique used for this kind of problem is to get some form of redundancy in the mixed signal in order to make it overdetermined. This is typically done by computing the spectrogram which represents the signal in the time and frequency domains simultaneously (splitting the signals into overlapping time frames). The computation of spectrograms can be summarized as follows: short time segments are extracted from the signal and multiplied element wise by a window function or “smoothing” window of size $F$ . Successive windows overlap by a fraction of their length, which is usually taken as 50%. On each of these segments, a discrete Fourier transform is computed and stacked column-by-column in a matrix $X$ . Thus, from a one-dimensional signal $x\in\mathbb{R}^{T}$ , we obtain a complex matrix $X\in\mathbb{C}^{F\times N}$ called spectrogram where $F\times N\simeq 2T$ (due to the 50% overlap between windows). Note that the length of the window determines the shape of the spectrogram. These preliminary operations correspond to computing the short time Fourier transform (STFT), which is given by the following formula: for $1\leq f\leq F$ and $1\leq n\leq N$ , $X_{f,n}=\sum_{j=0}^{F-1}w_{j}x_{nL+j}e^{(-i\frac{2\pi fj}{F})}$ , where $w\in\mathbb{R}^{F}$ is the smoothing window of size $F$ , $L$ is a shift parameter (also called hop size), and $H=F-L$ is the overlap parameter. The number of rows corresponds to the frequency resolution. Letting $f_{s}$ be the sampling rate of the audio signal, consecutive rows correspond to frequency bands that are $f_{s}/F$ Hz apart.

The time-frequency representation of a signal highlights two of its fundamental properties: sparsity and redundancy. Sparsity comes from the fact that most real signals are not active at all frequencies at all time points. Redundancy comes from the fact that frequency patterns of the sources repeat over time. Mathematically, this means that the spectrogram is a low-rank matrix. These two fundamental properties led sound source separation techniques to integrate algorithms such as nonnegative matrix factorization (NMF). Such techniques retrieve sensible solutions even for single-channel signals.

I-A Mixing assumptions

Given $K$ source signals $s^{(k)}\in\mathbb{R}^{T}$ for $1\leq k\leq K$ , we assume the acquisition process is well modelled by a linear instantaneous mixing model:

[TABLE]

Therefore, for each time index $t$ , the mixed signal $x(t)$ from a single microphone is the sum of the $K$ source signals. It is standard to assume that microphones are linear as long as the recorded signals are not too loud. If signals are too loud, they are usually clipped. The mixing process is modelled as instantaneous as opposed to convolutive used to take into account sound effects such as reverberation. The source separation problem consist in finding source estimates $\hat{s}^{(k)}$ of $s^{(k)}$ sources for all $k\in\{1,\dots,K\}$ . Let us denote $S$ the linear STFT operator, and let $S^{\dagger}$ be its conjugate transpose. We have $S^{\dagger}S=FI$ , where $I$ is the identity matrix of appropriate dimension. For the remainder of this paper, $S^{\dagger}$ stands for the inverse short time Fourier transform. Note that the term inverse is not meant in a mathematical sense. Indeed the STFT is not a surjective transformation from $\mathbb{R}^{T}$ to $\mathbb{C}^{F\times N}$ . In other words, each spectrogram or each matrix with complex entries is not necessarily the STFT of a real signal; see [1] and [2] for more details. By applying the STFT operator $S$ to (1), we obtain the mixing model in the time-frequency domain :

[TABLE]

where $S^{(k)}$ is the STFT of the source $k$ , that is, the spectrogram of source $k$ . To identify the sources, we use in this paper the amplitude spectrogram $V=|X|\in\mathbb{R}_{+}^{F\times N}$ defined as $V_{fn}=\left|X_{fn}\right|$ for all $f$ , $n$ . We assume that $V=\sum_{k=1}^{K}\left|S^{(k)}\right|$ , which means that there is no sound cancellation between the sources, which is usually the case in most signals. Finally, we assume that the source spectrograms $\left|S^{(k)}\right|$ are well approximated by nonnegative rank-one matrices. This leads to the NMF model described in the next section. Note that a source can be made of several rank-one factors in which case a post-processing step will have to recombine them a posteriori (e.g., looking at the correspondence in the activation of the sources over time). Note also that we focus on the NMF stage of the source separation which factorizes $V$ into the source spectrograms. For the phases reconstruction, which is a highly non-trivial problem, we consider a naive reconstruction procedure consisting in keeping the same phase as the input mixture for each source [1].

I-B NMF for audio source separation

Given a non-negative matrix $V\in\mathbb{R}_{+}^{F\times N}$ (the spectrogram) and a positive integer $K\ll\min(F,N)$ (the number of sources, called the factorization rank), NMF aims to compute two non-negative matrices $W$ with $K$ columns and $H$ with $K$ rows such that $V\approx WH$ . NMF approximate each column of $V$ by a linear combination of the columns of $W$ weighted by the components of the corresponding column of $H$ [3]. When the matrix $V$ corresponds to the amplitude spectrogram or the power spectrogram of an audio signal, we have that

$\bullet$ $W$ is referred as the dictionary matrix and each column corresponds to the spectral content of a source, and

$\bullet$ $H$ is the activation matrix specifying if a source is active at a certain time frame and in which intensity.

In other words, each rank-one factor $W(:,k)H(k,:)$ will correspond to a source: the $k$ th column $W(:,k)$ of $W$ is the spectral content of source $k$ , and the $k$ th row $H(k,:)$ of $H$ is its activation over time. To compute $W$ and $H$ , NMF requires to solve the following optimization problem

[TABLE]

where $A\geq 0$ means that $A$ is component-wise nonnegative, and $d(x|y)$ is an appropriate measure of fit. In audio source separation, a common measure of fit is the discrete $\beta$ -divergence denoted $d_{\beta}(x|y)$ and equal to

[TABLE]

For $\beta=2$ , this the standard squared Euclidean distance, that is, the squared Frobenius norm $||V-WH||_{F}^{2}$ . For $\beta=1$ and $\beta=0$ , the $\beta$ -divergence corresponds to the Kullback-Leibler (KL) divergence and the Itakura-Saito (IS) divergence, respectively. The error measure which should be chosen accordingly with the noise statistic assumed on the data. The Frobenius norm assumes i.i.d. Gaussian noise, KL divergence assumes additive Poisson noise, and the IS divergence assumes multiplicative Gamma noise [4]. The $\beta$ -divergence $d_{\beta}(x|y)$ is homogeneous of degree $\beta$ : $d_{\beta}(\lambda x|\lambda y)=\lambda^{\beta}d_{\beta}(x|y)$ . It implies that factorizations obtained with $\beta>0$ (such as the Euclidean distance or the KL divergence) will rely more heavily on the largest data values and less precision is to be expected in the estimation of the low-power components. The IS divergence ( $\beta=0$ ) is scale-invariant that is $d_{IS}(\lambda x|\lambda y)=d_{IS}(x|y)$ [5]. The IS divergence is the only one in the $\beta$ -divergences family to possess this property. It implies that time-frequency areas of low power are as important in the divergence computation as the areas of high power. This property is interesting in audio source separation as low-power frequency bands can perceptually contribute as much as high-power frequency bands. Note that both KL and IS divergences are more adapted to audio source separation than Euclidean distance as it is built on logarithmic scale as human perception; see [1] and [5]. Moreover, the $\beta$ -divergence is only convex with respect to $W$ (or $H$ ) if $\beta\geq 1$ . Otherwise, the objective function is non-convex. This implies that, for $\beta<1$ , even the problem of inferring $H$ with $W$ fixed is non-convex. For more details on $\beta$ -divergences; see [5].

I-C Contribution and outline of the paper

In Section II, we propose a new NMF model, referred to as minimum-volume $\beta$ -NMF (min-vol $\beta$ -NMF), to tackle the audio source separation problem. This model penalizes the columns of the dictionary matrix $W$ so that their convex hull has a small volume. To the best of our knowledge, this model is novel in two aspects: (1) it is the first time a minimum-volume penalty is associated with a $\beta$ -divergence for $\beta\neq 2$ and it is the first time such models are used in the context of audio source separation, and (2) as opposed to most previously proposed minimum-volume NMF models, our model imposes a normalization constraints on the factor $W$ instead of $H$ . As far as we know, the only other paper that used a normalization of $W$ is [6] but the authors did not justify this choice compared to the normalization of $H$ (the choice seems arbitrary, motivated by the ‘elimination of the norm indeterminacy’), nor provided theoretical guarantees. In this paper, we explain why normalization of $W$ is a better choice in practice, and we prove that, under some mild assumptions and in the noiseless case, this model provably identify the sources; see Theorem 1. To the best of our knowledge, this is the first result of this type in the audio source separation literature. In Section III, we propose an algorithm to tackle min-vol $\beta$ -NMF, focusing on the KL and IS divergences. The algorithm is based on multiplicative updates (MU) that are derived using the standard majorization-minimization framework, and that monotonically decrease the objective function. In Section IV, we present several numerical experiments, comparing min-vol $\beta$ -NMF with standard NMF and sparse NMF. The two mains conclusions are that (1) minimum-volume $\beta$ -NMF performs consistently better to identify the sources, and (2) as opposed to NMF and sparse NMF, min-vol $\beta$ -NMF is able to detect when the factorization rank is overestimated by automatically setting sources to zero.

II Minimum-volume NMF with $\beta$ -divergences

In this section, we present a new model of separation based on the minimization of $\beta$ -divergences including a penalty term promoting solutions with minimum volume spanned by the columns of the dictionary matrix $W$ . Section II-A recalls the geometric interpretation of NMF which motivated the use of a minimum volume penalty on the dictionary $W$ . Section II-B discusses the new proposed normalization compared to previous minimum volume NMF models, and proves that min-vol $\beta$ -NMF provably recovers the true factors $(W,H)$ under mild conditions and in the noiseless case; see Theorem 1.

II-A Geometry and the min-vol $\beta$ -NMF model

As mentioned earlier, $V=WH$ means that each column of $V$ is a linear combination of the columns of $W$ weighted by the components of the corresponding column of $H$ ; in fact, $v_{n}=Wh_{n}$ for $n=1,...,N$ , where $v_{n}$ denotes the $n$ th column of data matrix $V$ . This gives to NMF a nice geometric interpretation: for all $n$

[TABLE]

meaning that the columns of $V$ are contained in the convex cone generated by the columns of $W$ ; see Figure 1 for an illustration.

From this interpretation, it follows that, in general, NMF decompositions are not unique because there exists several (often, infinitely many) sets of columns of $W$ that span the convex cone generated by the data points; see for example [8] for more details. Hence, NMF is in most cases ill-posed because the optimal solution is not unique. In order to make the solution unique (up to permutation and scaling of the columns of $W$ and the rows of $H$ ) hence making the problem well-posed and the parameters $(W,H)$ of the problem identifiable, a key idea is to look for a solution $W$ with minimum volume. Intuitively, we will look for the cone $\operatorname{cone}(W)$ containing the data points and as close as possible to these data points. The use of minimum-volume NMF has lead to a new class of NMF methods that outperforms existing ones in many applications such as document analysis and blind hyperspectral unmixing; see the recent survey [9]. Note that minimum-volume NMF implicitly enhances the factor $H$ to be sparse: the fact that $W$ has a small volume implies that many data points will be located on the facets of the $\operatorname{cone}(W)$ hence $H$ will be sparse.

Hence, in this paper, we consider the following model, referred to as min-vol $\beta$ -NMF:

[TABLE]

where $\Delta^{F}=\left\{x\in\mathbb{R}^{F}_{+}\big{|}\sum_{i=1}^{F}x_{i}=1\right\}$ is the unit simplex, $\lambda$ is a penalty parameter and $\text{vol}(W)$ is a function that measures the volume spanned by the columns of $W$ . In this paper, we use $\text{vol}(W)=\operatorname{logdet}(W^{T}W+\delta I)$ , where $\delta$ is a small positive constant that prevents $\operatorname{logdet}(W^{T}W)$ to go to $-\infty$ when $W$ tends to a rank-deficient matrix (that is, when $r=\text{rank}(W)<K$ ). The reason for using such a measure is that $\sqrt{\text{det}\left(W^{T}W\right)}/K!$ is the volume of the convex hull of the columns of $W$ and the origin. This measure is one of the most widely used ones, and has been shown to perform very well in practice [10, 11]. Moreover, the criterion $\operatorname{logdet}(W^{T}W+\delta I)$ is able to distinguish two rank-deficient solutions and favour solutions for $W$ with smaller volume [12]. Finally, as we will illustrate in Section IV, this criterion is able to identify the right number of source even when $K$ is overestimated, by putting some rank-one factors to zero.

II-B Normalization and identifiability of min-vol $\beta$ -NMF

As mentioned above, under some appropriate conditions on $V=WH$ , minimum-volume NMF models will provably recover the ground-truth $(W,H)$ that generated $V$ , up to permutation and scaling of the rank-one factors. The first identifiability results for minimum-volume NMF models assumed that the entries in each column of $H$ sum to one, that is, that $H^{T}e=e$ where $e$ is the all-one column vector whose dimension is clear from the context, meaning that $H$ is column stochastic [8, 13]. Under this condition, each column of $V$ lies in the convex hull of the columns of $W$ ; see Figure 2 for an illustration.

Under the three assumptions that (1) $H$ is column stochastic, (2) $W$ is full column rank, and (3) $H$ satisfies the sufficiently scattered condition, minimizing the volume of $\operatorname{conv}(W)$ such that $V=WH$ recover the true underlying factors, up to permutation and scaling. Intuitively, the sufficiently scattered condition requires $H$ to be sparse enough so that data points are located on the facets of $\operatorname{conv}(W)$ ; see Appendix -A for a formal definition. The sufficiently scattered condition makes sense for most audio source data sets as it is reasonable to assume that, for most time points, only a few sources are active hence $H$ is sparse; see [9] for more details on the sufficiently scattered condition. Note that the sufficiently scattered condition is a generalization of the separability condition which requires $W=V\left(:,\mathcal{J}\right)$ for some index set $\mathcal{J}$ of size $K$ [14]. However, separability is a much stronger assumption as it requires that, for each sources, there exists a time point where only that source is active. Note that although min-vol NMF guarantees identifiability, the corresponding optimization problem (2) is still hard to solve in general, as for the original NMF problem [15].

Despite this nice result, the constraint $H^{T}e=e$ makes the NMF model less general and does not apply to all data sets. In the case where the data does not naturally belong to a convex hull, one needs to normalize the data points so that their entries sum to one so that $H^{T}e=e$ can be assumed without loss of generality (in the noiseless case). This normalization can sometimes increase the noise and might greatly influence the solution, hence are usually not recommended in practice; see the discussion in [9].

In [7], authors show that identifiability still holds when the condition that $H$ is column stochastic is relaxed to $H$ being row stochastic. As opposed to column stochasticity, row stochasticity of $H$ can be assumed without loss of generality since any factorization $WH$ can be properly normalized so that this assumption holds. In fact, $WH=\sum_{k=1}^{K}(a_{k}W(:,k))(H(k,:)/a_{k})$ for any $a_{k}>0$ for $k=1,\dots,K$ . In other terms, letting $A$ be the diagonal matrix with $A(k,k)=a_{k}=\sum_{j=1}^{n}H(k,j)$ for $k=1,\dots,K$ , we have $WH=(WA)(A^{-1}H)=W^{\prime}H^{\prime}$ where $H^{\prime}=A^{-1}H$ is row stochastic.

Similarly as done in [7], we prove in this paper that requiring $W$ to be column stochastic (which can also be made without loss of generality) also leads to identifiability. Geometrically, the columns of $W$ are constrained to be on the unit simplex. Minimizing the volume still makes a lot of sense: we want the columns of $W$ to be as close as possible to one another within the unit simplex. In Appendix -A, we prove the following theorem.

Theorem 1.

Assume $V=W^{\#}H^{\#}$ with $\text{rank}(V)=K$ , $W^{\#}\geq 0$ and $H^{\#}$ satisfies the sufficiently scattered condition (Definition 2 in Appendix -A). Then the optimal solution of

[TABLE]

recovers $(W^{\#},H^{\#})$ up to permutation and scaling.

Proof.

See Appendix -A. ∎

In noiseless conditions, replacing $W^{T}e=e$ with $He=e$ in (3) leads to the same identifiability result; see [7, Theorem 1]. Therefore, in noiseless conditions and under the conditions of Theorem 1, both models return the same solution up to permutation and scaling. However, in the presence of noise, we have observed that the two models may behave very differently. In fact, we advocate that the constraint $W^{T}e=e$ is better suited for noisy real-world problems, which we have observed on many numerical examples. In fact, we have observed that the normalization $W^{T}e=e$ is much less sensitive to noise and returns much better solutions. The reason is mostly twofold:

(i) As described above, using the normalization $He=e$ amounts to multiply $W$ by a diagonal matrix whose entries are the $\ell_{1}$ norms of the rows of $H$ . Therefore, the columns of $W$ that correspond to dominating (resp. dominated) sources, that is, sources with much more (resp. less) power and/or active at many (resp. few) time points, will have much higher (resp. lower) norm. Therefore, the term $\operatorname{logdet}(W^{T}W+\delta I)$ is much more influenced by the dominating sources and will have difficulties to penalize the dominated sources. In other terms, the use of the term $\operatorname{logdet}(W^{T}W+\delta I)$ with the normalization $He=e$ implicitly requires that the rank-one factors $W(:,k)H(k,:)$ for $k=1,\dots,K$ are well balanced, that is, have similar norms. This is not the case for many real (audio) signals.

(ii) As it will be explained in Section III, the update of $W$ needs the computation of the matrix $Y$ which is the inverse of ${W}^{T}{W}+\delta I$ –this terms appears in the gradient with respect to $W$ of the objective function. The numerical stability for such operations is related to the condition number of $W^{T}W+\delta I$ . For a $\ell_{1}$ normalization on the columns of $W$ , the condition number is bounded above as follows: $\operatorname{cond}(W^{T}W+\delta I)=\frac{\sigma_{\max}(W^{T}W+\delta I)}{\sigma_{\min}(W^{T}W+\delta I)}=\frac{\sigma_{\max}(W)^{2}+\delta}{\sigma_{\min}(W)^{2}+\delta}\leq\frac{\left(\sqrt{K}\max_{k}||W(:,k)||_{2}\right)^{2}+\delta}{\delta}\leq 1+\frac{K}{\delta}$ , where $\sigma_{\min}(W)$ and $\sigma_{\max}(W)$ are the smallest and largest singular values of $W$ , respectively. In the numerical experiments, we use $\delta=1$ . On the other hand, the normalization $He=e$ may lead to arbitrarily large values for the condition number of $W^{T}W+\delta I$ , which we have observed numerically on several examples. This issue can be mitigated with the use of the normalization $He=\rho e$ for some $\rho>0$ sufficiently large for which identifiabilty still holds [7]. However, it still performs worse because of the first reason explained above.

For these reasons, we believe that the model (3) would also be better suited (compared to the normalization on $H$ ) in other contexts; for example for document classification [16].

III Algorithm for min-vol $\beta$ -NMF

Most NMF algorithms alternatively update $H$ for $W$ fixed and vice versa, and we adopt this strategy in this paper. For $W$ fixed, (2) is equivalent to standard NMF and we will use the MU that have already been derived in the literature [3, 5].

To tackle (2) for $H$ fixed, let us consider

[TABLE]

Note that, for now, we have discarded the normalization on the columns of $W$ . In our algorithm, we will use the update for $W$ obtained by solving (4) as a descent direction along with a line search procedure to integrate the constraint on $W$ . This will ensure that the objective function $F$ is non-increasing at each iteration. In the following sections we derive MU for $W$ that decrease the objective in (4). We follow the standard majorization-minimization framework [17]. First, an auxiliary function, which we denote $\bar{F}$ , is constructed so that it majorizes the objective. An auxiliary function for $F$ at point $\tilde{W}$ is defined as follows.

Definition 1.

The function $\bar{F}(W|\tilde{W}):\Omega\times\Omega\rightarrow\mathbb{R}$ is an auxiliary function for $F\left(W\right):\Omega\rightarrow\mathbb{R}$ at $\tilde{W}\in\Omega$ if the conditions $\bar{F}(W|\tilde{W})\geq F\left(W\right)$ for all $W\in\Omega$ and $\bar{F}(\tilde{W}|\tilde{W})=F(\tilde{W})$ are satisfied.

Then, the optimization of $F$ can be replaced by an iterative process that minimizes $\bar{F}$ . More precisely, the new iterate $W^{(i+1)}$ is computed by minimizing exactly the auxiliary function at the previous iterate $W^{(i)}$ . This guarantees $F$ to decrease at each iteration.

Lemma 1.

Let $W,W^{(i)}\geq 0$ , and let $\bar{F}$ be an auxiliary function for $F$ at $W^{(i)}$ . Then $F$ is non-increasing under the update $W^{(i+1)}=\underset{W\geq 0}{\text{argmin}}\bar{F}(W|W^{(i)})$ .

Proof.

In fact, we have by definition that $F(W^{(i)})=\bar{F}(W^{(i)}|W^{(i)})\geq\underset{W}{\text{min}}\bar{F}(W|W^{(i)})=\bar{F}(W^{(i+1)}|W^{(i)})\geq F(W^{(i+1)})$ . ∎

The most difficult part in using the majorization-minimization framework is to design an auxiliary function that is easy to optimize. Usually such auxiliary functions are separable (that is, there is no interaction between the variables so that each entry of $W$ can be updated independently) and convex.

III-A Separable auxiliary functions for $\beta$ -divergences

For the sake of completeness, we briefly recall the auxiliary function proposed in [5] for the data fitting term. It consists in majorizing the convex part of the $\beta$ -divergence using Jensen’s inequality and majorizing the concave part by its tangent (first-order Taylor approximation). We have

[TABLE]

where $\check{d}$ is convex function of $y$ , $\hat{d}$ is a concave function of $y$ and $\bar{d}$ is a constant of $y$ ; see Table I.

The function $D_{\beta}(V|WH)$ can be written as $\sum_{f}D_{\beta}(v_{f}|w_{f}H)$ where $v_{f}$ and $w_{f}$ are respectively the $f$ th row of $V$ and $W$ . Therefore we only consider the optimization over one specific row of $W$ . To simplify notation, we denote iterates $w^{(i+1)}$ (next iterate) and $w^{(i)}$ (current iterate) as $w$ and $\tilde{w}$ , respectively.

Lemma 2 ([5]).

Let $\tilde{v}=\tilde{w}H$ and $\tilde{w}$ be such that $\tilde{v_{n}}>0$ for all $n$ and $\tilde{w_{k}}>0$ for all $k$ . Then the function

[TABLE]

is an auxiliary function for $\sum_{n}d(v_{n}|\left[wH\right]_{n})$ at $\tilde{w}$ .

III-B A separable auxiliary function for the minimum-volume regularizer

The minimum-volume regularizer $\operatorname{logdet}(W^{T}W+\delta I)$ is a non-convex function. However, it can be upper-bounded using the fact that $\operatorname{logdet}(.)$ is a concave function so that its first-order Taylor approximation provides an upper bound; see for example [10]. For any positive-definite matrices $A$ and $B\in\mathbb{R}^{K\times K}$ , we have:

[TABLE]

This implies that for any $W,Z\in\mathbb{R}^{F\times K}$ , we have

[TABLE]

where $l(W,Z)=\operatorname{trace}\left(YW^{T}W\right)+\operatorname{logdet}\left(Y^{-1}\right)-K$ , $Y=(Z^{T}Z+\delta I)^{-1}$ with $\delta>0$ . Note that $Z^{T}Z+\delta I$ is positive definite hence is invertible and its inverse $Y$ is also positive definite. Finally $l(W,Z)$ is an auxiliary function for $\operatorname{logdet}(W^{T}W+\delta I)$ at $Z$ . However, it is quadratic and not separable hence non-trivial to optimize over the nonnegative orthant. The non-constant part of $l(W,Z)$ can be written as $\sum_{f}w_{f}Yw_{f}^{T}$ where $w_{f}$ is the $f$ th row of $W$ . Henceforth we will focus on one particular row vector $w$ with $l\left(w\right)=w^{T}Yw$ which will be further considered as a column vector of size $K\times 1$ .

Lemma 3.

Let $w,\tilde{w}\in\mathbb{R}^{K}_{+}$ be such that $\tilde{w_{k}}>0$ for all $k$ , $Y=Y^{+}-Y^{-}$ with $Y^{+}=\max\left(Y,0\right)$ and $Y^{-}=\max\left(-Y,0\right)$ , and $\Phi\left(\tilde{w}\right)$ be the diagonal matrix $\Phi\left(\tilde{w}\right)=\operatorname{Diag}\left(2\frac{[Y^{+}\tilde{w}+Y^{-}\tilde{w}]}{[\tilde{w}]}\right)$ where $\frac{\left[A\right]}{\left[B\right]}$ is the component-wise division between $A$ and $B$ , and $\Delta w=w-\tilde{w}$ . Then

[TABLE]

is a separable auxiliary function for $l\left(w\right)$ = $w^{T}Yw$ at $\tilde{w}$ .

Proof.

See Appendix -B. ∎

Remark 1 (Choice of the auxiliary function).

A simpler choice for the auxiliary function would be to replace $\Phi(\tilde{w})$ with $2\lambda_{\max}(Y)I$ where $\lambda_{\max}(Y)$ is the largest eigenvalue of $Y$ (the constant $2$ appears because $l\left(w\right)=w^{T}Yw$ while there is a factor $1/2$ in front of $\Phi(\tilde{w})$ ). However, it would lead to a worse approximation. In particular if $Y$ is a diagonal matrix (since $Y\succ 0$ , these diagonal elements are positive), our choice gives $\Phi(\tilde{w})=2Y$ for any $\tilde{w}>0$ , meaning that the auxiliary function matches perfectly the function $l\left(w\right)$ . This would not be the case for the choice $2\lambda_{\max}(Y)I$ (unless $Y$ is a scaling of the identity matrix).

III-C Auxiliary function for min-vol $\beta$ -NMF

Based on the auxiliary functions presented in Sections III-A and III-B, we can directly derive a separable auxiliary function $\bar{F}(W|\tilde{W})$ for min-vol $\beta$ -NMF (2).

Corollary 1.

For $W,H\geq 0$ , $\lambda>0$ , $Y=(\tilde{W}^{T}\tilde{W}+\delta I)^{-1}$ with $\delta>0$ and the constant $c=\operatorname{logdet}\left(Y^{-1}\right)+K$ , the function

[TABLE]

where $G$ is given by (6) and $\bar{l}$ by (8), is a convex and separable auxiliary function for $F(W)=D_{\beta}(V|WH)+\lambda\operatorname{logdet}(W^{T}W+\delta I)$ at $\tilde{W}$ .

Proof.

This follows directly from Lemma 2, Equation (7) and Lemma 3. ∎

In the following section, we provide explicitly MU for the KL divergence ( $\beta=1$ ) by finding a closed-form solution for the minimization of $\bar{F}$ . In Appendix -C, we provide the MU for the IS divergence ( $\beta=0$ ). Due to the lack of space, the other cases are not treated explicitly but can be in a similar way. For the same reason, we will only compare KL NMF models in the numerical experiments (Section IV).

III-D Algorithm for min-vol KL-NMF

As before, let us focus on a single row of $W$ , denoted $w$ , as the objective function $F(W)$ is separable by row. For $\beta=1$ , the derivative of the auxiliary function $\bar{F}(w|\tilde{w})$ with respect to a specific coefficient $w_{k}$ is given by $\nabla_{w_{k}}\bar{F}(w|\tilde{w})=\sum_{n}h_{kn}-\sum_{n}h_{kn}\frac{\tilde{w}_{k}v_{n}}{w_{k}\tilde{v}_{n}}+2\lambda\left[Y\tilde{w}\right]_{k}+2\lambda\left[\operatorname{Diag}\left(\frac{Y^{+}\tilde{w}+Y^{-}\tilde{w}}{\tilde{w}}\right)\right]_{k}w_{k}-2\lambda\left[\operatorname{Diag}\left(\frac{Y^{+}\tilde{w}+Y^{-}\tilde{w}}{\tilde{w}}\right)\right]_{k}\tilde{w}_{k}$ . Due to the separability, we set the derivative to zero to obtain the closed-form solution, which is given in Table II in matrix form.

Note that although the closed-form solution has a negative term in the numerator of the multiplicative factor (see Table II), they always remain nonnegative given that $V,H$ and $\tilde{W}$ are nonnegative. In fact, the term before the minus sign is always larger than the term after the minus sign: $J_{F,N}H^{T}-4\lambda(\tilde{W}Y^{-})$ is squared (component wise) and added a positive term, hence the component-wise square root of that result is larger than $J_{F,N}H^{T}-4\lambda(\tilde{W}Y^{-})$ .

Algorithm 1 summarizes our algorithm to tackle (2) for $\beta=1$ which we refer to as min-vol KL-NMF. Note that the update of $H$ (step 4) is the one from [3]. More importantly, note that we have incorporated a line-search for the update of $W$ . In fact, although the MU for $W$ are guaranteed to decrease the objective function, they do not guarantee that $W$ remains normalized, that is, that $||W(:,k)||_{1}=1$ for all $k$ . Hence, we normalize $W$ after it is updated (step 10), and we normalize $H$ accordingly so that $WH$ remains unchanged. When this normalization is performed, the $\beta$ -divergence part of $F$ is unchanged but the minimum-volume penalty will change so that the objective function $F$ might increase. In order to guarantee non-increasingness, we integrate a simple backtracking line-search procedure; see steps 11-16 of Algorithm 1. In summary, our MU provide a descent direction that preserved nonnegativity of the iterates, and we use a projection and a simple backtracking line-search to guarantee the monotonicity of the objective function, as in standard projected gradient descent methods.

It can be verified that the computational complexity of the min-vol KL-NMF is asymptotically equivalent to the standard MU for $\beta$ -NMF, that is, it requires $\mathcal{O}\left(FNK\right)$ operations per iteration. Indeed, all the main operations include matrix products with a complexity of $\mathcal{O}\left(FNK\right)$ and element-wise operations on matrices of size $F\times K$ or $K\times N$ . Note that the inversion of the $K$ -by- $K$ matrix $(W^{T}W+\delta I)$ requires $\mathcal{O}\left(K^{3}\right)$ operations which is dominated by $\mathcal{O}\left(FNK\right)$ since $K\leq\min(F,N)$ (in fact, typically $K\ll\min(F,N)$ hence this term is negligible). Therefore, although Algorithm 1 will be slower than the baseline KL-NMF (that is, the standard MU) because of the additional terms to be computed and the line-search, the asymptotical computational cost is the same; see Table IV for runtime comparison.

IV Numerical experiments

In this section we report an experimental comparative study of baseline KL-NMF, min-vol KL-NMF (Algorithm 1) and sparse KL-NMF [18] applied to the spectrogram of two monophonic piano sequences and a synthetic mix of a bass and drums. For the two monophonic piano sequences, the audio signals are true life signals with standard quality. Note that the sequences are made of pure piano notes, the number $K$ should therefore correspond to the number of notes present into the mixed signals. The comparative study is performed for several values of $K$ with a focus on the case where the factorization rank $K$ is overestimated. For all simulations, random initializations are used for $W$ and $H$ , the best results among 5 runs are kept for the comparative study. In all cases, we use a Hamming window of size $F$ =1024, and 50% overlap between two frames. Sparse KL-NMF has a similar structure as min-vol KL-NMF, with a penalty parameter for the sparsity enforcing regularization. To tune these two parameters, we have used the same strategy for both methods: we manually tried a wide range of values and report the best results. The code is available from bit.ly/minvolKLNMF (code written in MATLAB R2017a), and can be used to rerun directly all experiments below. They were run on a laptop computer with Intel Core i7-7500U CPU $@$ 2.70GHz 4 and 32GB memory.

Mary had a little lamb

The first audio sample is the first measure of “Mary had a little lamb”. The sequence is composed of three notes; $E_{4}$ , $D_{4}$ and $C_{4}$ , played all at once. The recorded signal is 4.7 seconds long and downsampled to $f_{s}=16000$ Hz yielding $T$ =75200 samples. STFT of the input signal $x$ yields a temporal resolution of 16ms and a frequency resolution of 31.25Hz, so that the amplitude spectrogram $V$ has $N$ =294 frames and $F$ =257 frequency bins. The musical score is shown on Figure 3.

All NMF algorithms were run for 200 iterations which allowed them to converge. Figure 4 presents the columns of $W$ (dictionary matrix) and the rows of $H$ for baseline KL-NMF and min-vol KL-NMF with $K=3$ .

Figure 5 presents the time-frequency masking coefficients. These coefficients are computed as follows

[TABLE]

where $\hat{X}^{(k)}=W(:,k)H(k,:)$ is the estimated source $k$ . The masks are nonnegative and sum to one for each pair $\left(f,n\right)$ . This representation allows to identify visually whether the NMF algorithm was able to separate the sources properly.

All the simulations give a nice separation with similar results for $W$ and $H$ . The activations are coherent with the sequences of the notes. However, Figure 5 shows that min-vol KL-NMF and sparse KL-NMF provide a better separation in terms of time-frequency localization compared to the baseline KL-NMF.

We now perform the same experiment but using $K$ =7. Figure 6 presents the results. This situation corresponds to the situation where the factorization rank is overestimated. Figure 7 presents the time-frequency masking coefficients.

We observe that min-vol KL-NMF is able to extract the three notes correctly and set automatically to zero three source estimates (more precisely, three rows of $H$ are set to zero, while the corresponding columns of $W$ have entries equal to one another as $||W(:,k)||_{1}=1$ for all $k$ ) while baseline KL-NMF and sparse KL-NMF split the notes in all the sources. One can observe that a fourth note is identified in all simulations (see isolated peaks on Figure 7-(b), second row of $H$ from the top) and corresponds to the noise within the piano just before triggering a specific note (in particular, the hammer noise). This observation is confirmed by the fact that the amplitude is proportional to the natural strength of the fingers playing the notes. In this scenario, with $K$ is overestimated, min-vol KL-NMF outperforms baseline KL-NMF and sparse KL-NMF.

Prelude of Bach

The second audio sample corresponds to the first 30 seconds of “Prelude and Fugue No.1 in C major” from J. S. Bach played by Glenn Gould111https://www.youtube.com/watch?v=ZlbK5r5mBH4. The audio sample is a sequence of 13 notes: $B_{3}$ , $C_{4}$ , $D_{4}$ , $E_{4}$ , $F^{\#}_{4}$ , $G_{4}$ , $A_{4}$ , $C_{5}$ , $D_{5}$ , $E_{5}$ , $F_{5}$ , $G_{5}$ , $A_{5}$ . The recorded signal is downsampled to $f_{s}=11025$ Hz yielding $T$ =330750 samples. STFT of the input signal $x$ yields a temporal resolution of 46ms and a frequency resolution of 10.76Hz, so that the amplitude spectrogram $V$ has $N$ =647 frames and $F$ =513 frequency bins. The musical score is presented on Figure 8. All NMF algorithms were run for 300 iterations which allowed them to converge. Figure 9 presents the results obtained for $W$ and $H$ with a factorization rank $K=16$ , hence overestimated. We observe that min-vol KL-NMF automatically sets three components to zero (with * symbol on Figure 9) while 13 source estimates are determined. The analysis of the fundamentals (maximum peak frequency) of the 13 source estimates correspond to the theoretical fundamentals of the 13 notes mentioned earlier. Note that using baseline KL-NMF or sparse KL-NMF led to same conclusions as for the first audio sample; these two algorithms generate as many source estimates as imposed by the rank of factorization while min-vol KL-NMF algorithm preserves the integrity of the 13 sources. Additionally, the activations are coherent with the sequences of the notes. Figure 10 shows (on a limited time interval) that the estimate sequence follows the sequence defined in the score. Note that a threshold and permutations on rows of $H$ was used to improve visibility.

Bass and drums

The third audio signal is a synthetic mix of a bass and drums222http://isse.sourceforge.net/demos.html. The audio signal is downsampled to $f_{s}$ = $16000$ Hz yielding $T$ =104821 samples. STFT of the input signal $x$ yields a temporal resolution of 32ms and a frequency resolution of 15.62Hz, so that the amplitude spectrogram $V$ has $N$ =206 frames and $F$ =513 frequency bins. For this synthetic mix, we have access to the true sources under the form of two audio files. Therefore, we can estimate the quality of the separation with standard metrics, namely the signal to distortion ratios (SDR), the source to interference ratios (SIR) and the sources to artifacts ratios (SAR) [19]. They have been computed with the toolbox BSS Eval333http://bass-db.gforge.inria.fr/bss_eval/. The metrics are expressed in dB and the higher they are the better is the separation. Algorithms min-vol KL-NMF, baseline KL-NMF and sparse KL-NMF have been considered for this comparative study. A factorization rank equal to two is used. It is clear that the rank-one approximation is too simplistic for these sources but the goal is to compare the algorithms and show that min-vol KL-NMF is able to find a better solution even in this simplified context. All NMF algorithms were run for 400 iterations which allowed them to converge. Table III shows the results.

Except for SAR metric for the second source (drums), min-vol KL-NMF outperforms baseline KL-NMF and sparse KL-NMF.

Runtime performance

Let us compare the runtime of baseline KL-NMF, min-vol KL-NMF (Algorithm 1) and sparse KL-NMF [18]. The algorithms are compares on the three examples presented in paragraphs IV and IV:

•

Setup $\sharp$ 1: sample “Mary had a little lamb” with $K=3$ , 200 iterations.

•

Setup $\sharp$ 2: sample “Mary had a little lamb” with $K=7$ , 200 iterations.

•

Setup $\sharp$ 3: “Prelude and Fugue No.1 in C major” with $K=16$ , 300 iterations.

For each test setup, the algorithms are run for the same 20 random initializations of $W$ and $H$ . Table IV reports the average and standard deviation of the runtime (in seconds) over these 20 runs. We observe that the runtime of min-vol KL-NMF (Algorithm 1) is slower but not significantly so, as expected. In particular, on the larger setup $\sharp$ 3, it is less than three times slower than the standard MU.

V Conclusion and Perspectives

In this paper, we have presented a new NMF model of audio source separation based on the minimization of a cost function that includes a $\beta$ -divergence (data fitting term) and a penalty term that promotes solutions $W$ with minimum volume. We have proved the identifiability of the model in the exact case, under the sufficiently scattered condition for the activation matrix $H$ . We have provided multiplicative updates to tackle this problem and have illustrated the behaviour of the method on real-world audio signals. We highlighted the capacity of the model to deal with the case where $K$ is overestimated by setting automatically to zero some components and give good results for the source estimates.

Further work includes tackling the following questions:

•

Under which conditions can we prove the identifiability of min-vol $\beta$ -NMF in the presence of noise, and the rank-deficient case?

•

Can we prove that min-vol $\beta$ -NMF performs model order selection automatically? Under which conditions? We have observed this behaviour on many examples, but the proof remains elusive.

•

Can we design more efficient algorithms?

Further work also includes the use of our new model min-vol $\beta$ -NMF for other applications and the design of more efficient algorithms (for example, that avoid using a line-search procedure) with stronger convergence guarantees (beyond the monotonicity of the objective function).

Acknowledgments

We thank Kejun Huang and Xiao Fu for helpful discussion on Theorem 1, and giving us the insight to adapt their proof from [7] to our model (2). We also thank the reviewers for their insightful comments that helped us improve the paper.

-A Sufficiently scattered condition and identifiability

Before giving the definition of the sufficiently scattered condition from [8], let us first recall an important property of the duals of nested cones.

Lemma 4.

Let $\mathcal{C}_{1}$ and $\mathcal{C}_{2}$ be convex cones such that $\mathcal{C}_{1}\subseteq\mathcal{C}_{2}$ . Then $\mathcal{C}^{*}_{2}\subseteq\mathcal{C}^{*}_{1}$ where $\mathcal{C}^{*}_{2}$ and $\mathcal{C}^{*}_{1}$ are respectively the dual cones of $\mathcal{C}_{1}$ and $\mathcal{C}_{2}$ . The dual of a cone $\mathcal{C}$ is defined as $\mathcal{C}^{*}=\left\{y|x^{T}y\geq 0\text{ for all }x\in\mathcal{C}\right\}$ .

Definition 2.

(Sufficiently Scattered) A matrix $H\in\mathbb{R}_{+}^{K\times N}$ is sufficiently scattered if

$\mathcal{C}\subseteq\operatorname{cone}\left(H\right)$ , and 2. 2.

$\operatorname{cone}\left(H\right)^{*}\cap\text{bd}\mathcal{C}^{*}=\left\{\lambda e_{k}|\lambda\geq 0,k=1,...,K\right\}$ ,

where $\mathcal{C}=\left\{x|x^{T}e\geq\sqrt{K-1}\left\|x\right\|_{2}\right\}$ is a second order cone, $\mathcal{C}^{*}=\left\{x|x^{T}e\geq\left\|x\right\|_{2}\right\}$ , $\operatorname{cone}\left(H\right)=\left\{x|x=H\theta,\theta\geq 0\right\}$ is the conic hull of the columns of $H$ , and bd denotes the boundary of a set.

We can now prove Theorem 1.

Proof of Theorem 1.

Recall that $W^{\#}$ and $H^{\#}$ are the true latent factors that generated $V$ , with $\text{rank}(V)=K$ and $H^{\#}$ is sufficiently scattered. Let us consider $\hat{W}$ and $\hat{H}$ a feasible solution of (3). Since $\text{rank}(V)=K$ and $V=\hat{W}\hat{H}$ , we must have $\operatorname{rank}(\hat{W})=\text{rank}(\hat{H})=K$ . Hence there exists an invertible matrix $A\in\mathbb{R}^{K\times K}$ such that $\hat{W}=W^{\#}A^{-1}$ and $\hat{H}=AH^{\#}$ . Since $\hat{W}$ is a feasible solution of problem (3), we have

[TABLE]

where we assumed $e^{T}W^{\#}=e^{T}$ without loss of generality since $W^{\#}\geq 0$ and $\text{rank}(W^{\#})=K$ . Note that $e^{T}A^{-1}=e^{T}$ is equivalent to $e^{T}A=e^{T}$ . This means that matrix $A$ is column stochastic. Therefore we have that $e^{T}Ae=K$ . Since $\hat{H}$ is a feasible solution, we also have $\hat{H}=AH^{\#}\geq 0$ . Let us denote by $a_{j}$ the $j$ th row of A, and by $a^{T}_{k}$ the $k$ th column of $A^{T}$ . By the definition of the a dual cone, $AH^{\#}\geq 0$ means that the rows $a_{j}\in\operatorname{cone}(H^{\#})^{*}$ for $j=1,...,K$ . Since $H^{\#}$ is sufficiently scattered, $\operatorname{cone}\left(H\right)^{*}\subseteq\mathcal{C}^{*}$ (by Lemma 4) hence $a_{j}\in\mathcal{C}^{*}$ . Therefore we have $\left\|a_{j}\right\|_{2}\leq a_{j}e$ by definition of $\mathcal{C}$ . This leads to the following: $|\text{det}(A)|=|\text{det}(A^{T})|\leq\prod_{k=1}^{K}\left\|a^{T}_{k}\right\|_{2}=\prod_{j=1}^{K}\left\|a_{j}\right\|_{2}\leq\prod_{j=1}^{K}a_{j}e\leq\left(\frac{\sum_{j}^{K}a_{j}e}{K}\right)^{K}=\left(\frac{e^{T}Ae}{K}\right)^{K}=1$ . The first inequality is the Hadamard inequality, the second inequality is due to $a_{j}\in\mathcal{C}^{*}$ , the third inequality is the arithmetic-geometric mean inequality. Now we can conclude exactly as is done in [7, Theorem 1] by showing that matrix $A$ can only be a permutation matrix for an optimal solution ( $\hat{W}$ , $\hat{H}$ ) of (3), and therefore identifiability for model (3) holds. ∎

-B Proof of Lemma 3

Separability of $\bar{l}(w|\tilde{w})$ holds since $\color[rgb]{0,0,0}\Phi\left(\tilde{w}\right)\color[rgb]{0,0,0}$ is diagonal. The condition $\bar{l}(\tilde{w}|\tilde{w})=l(\tilde{w})$ from Definition 1 can be checked easily. It remains to prove that $\bar{l}(w|\tilde{w})\geq l(w)$ for all $w$ . Let us first rewrite the quadratic function $l(w)$ using its Taylor expansion at $w=\tilde{w}$ : $l(w)=l(\tilde{w})+\left(w-\tilde{w}\right)^{T}\nabla l\left(\tilde{w}\right)+\frac{1}{2}\left(w-\tilde{w}\right)^{T}\nabla^{2}l\left(\tilde{w}\right)\left(w-\tilde{w}\right)=l(\tilde{w})+\left(w-\tilde{w}\right)^{T}\color[rgb]{0,0,0}2\color[rgb]{0,0,0}Y\tilde{w}+\frac{1}{2}\left(w-\tilde{w}\right)^{T}2Y\left(w-\tilde{w}\right)$ . Proving that $\bar{l}(w|\tilde{w})\geq l(w)$ is equivalent to proving that $\frac{1}{2}\left(w-\tilde{w}\right)^{T}\left[\color[rgb]{0,0,0}\Phi\left(\tilde{w}\right)\color[rgb]{0,0,0}-2Y\right]\left(w-\tilde{w}\right)\geq 0$ , which boils down to proving that the matrix $\left[\color[rgb]{0,0,0}\Phi\left(\tilde{w}\right)\color[rgb]{0,0,0}-2Y\right]$ is positive semi-definite. We have $\color[rgb]{0,0,0}\Phi_{ij}(\tilde{w})\color[rgb]{0,0,0}=2\delta_{ij}\frac{(Y^{+}\tilde{w})_{i}+(Y^{-}\tilde{w})_{i}}{\tilde{w}_{i}}$ , where $\delta_{ij}$ is the Kronecker symbol. Let us consider the following matrix: $M_{ij}(\tilde{w})=\tilde{w}_{i}\left[\color[rgb]{0,0,0}\Phi\left(\tilde{w}\right)\color[rgb]{0,0,0}-2Y\right]_{ij}\tilde{w}_{j}$ , which is a rescaling of $\left[\color[rgb]{0,0,0}\Phi\left(\tilde{w}\right)\color[rgb]{0,0,0}-2Y\right]$ . It remains to show that $M$ is positive semi-definite444The remainder of the proof was suggested to us by one of the reviewers, it is more elegant and simpler than our original proof.. Since $M$ is symmetric and its diagonal entries are non-negative, it is sufficient to show that $M$ is diagonally dominant [horn1985matrix, Proposition 7.2.3], that is,

[TABLE]

We have for all $i$ that

[TABLE]

Since $Y_{ij}^{+}+Y_{ij}^{-}=\left|Y_{ij}\right|$ , we have

[TABLE]

implying that $M$ is diagonally dominant.

-C Algorithm for min-vol IS-NMF

For $\beta=0$ (IS divergence), the derivative of the auxiliary function $\bar{F}(w|\tilde{w})$ with respect to a specific coefficient $w_{k}$ is given by:

[TABLE]

Let

[TABLE]

Setting the derivative to zero requires to compute the roots of the following degree-three polynomial $\tilde{a}w_{k}^{3}+{\tilde{b}}w_{k}^{2}+{\tilde{d}}$ . We used the procedure developed in [20] which is based on the explicit calculation of the intermediary root of a canonical form of cubic. This procedure is able to provide highly accurate numerical results even for badly conditioned polynomials. The algorithm for min-vol IS-NMF follows the same steps as for min-vol KL-NMF: only the two steps corresponding to the updates of $W$ and $H$ have to be modified. For the update of $H$ (step 4), use the standard MU. For the update of $W$ (step 9), use

for $f\leftarrow 1$ to $F$

for $k\leftarrow 1$ to $K$

$\text{Compute }\tilde{a}\text{, }\tilde{b}\text{ and }\tilde{d}\text{ according to equations \eqref{eq:49}}$

Compute the roots of $\tilde{a}w_{k}^{3}+{\tilde{b}}w_{k}^{2}+{\tilde{d}}$

Pick $y$ among these roots and zero that minimizes

the objective

$W^{+}_{f,k}\leftarrow\text{max}\left(10^{-16},y\right)$

end for

Bibliography20

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] A. Lefèvre, “Méthode d’apprentissage de dictionnaire pour la séparation de sources audio avec un seul capteur,” Ph.D. dissertation, Ecole Normale Supérieure de Cachan, 2012.
2[2] P. Magron, “Reconstruction de phase par modèles de signaux : application à la séparation de sources audio,” Ph.D. dissertation, TELECOM Paris Tech, 2016.
3[3] D. Lee and H. Seung, “Algorithms for non-negative matrix factorization,” in NIPS’00 Proceedings of the 13th International Conference on Neural Information Processing Systems , NIPS. MIT Press Cambridge, 2000, pp. 535–541.
4[4] C. Févotte, N. Bertin, and J.-L. Durrieu, “Nonnegative matrix factorization with the Itakura-Saito divergence: With application to music analysis,” Neural computation , vol. 21, no. 3, pp. 793–830, 2009.
5[5] C. Févotte and J. Idier, “Algorithms for nonnegative matrix factorization with the β 𝛽 \beta -divergence,” Neural computation , vol. 23, no. 9, pp. 2421–2456, 2011.
6[6] G. Zhou, S. Xie, Z. Yang, J.-M. Yang, and Z. He, “Minimum-volume-constrained nonnegative matrix factorization: Enhanced ability of learning parts,” IEEE Transactions on Neural Networks , vol. 22, no. 10, pp. 1626–1637, 2011.
7[7] X. Fu, K. Huang, and N. D. Sidiropoulos, “On identifiability of nonnegative matrix factorization,” IEEE Signal Processing Letters , vol. 25, no. 3, pp. 328–332, 2018.
8[8] K. Huang, N. Sidiropoulos, and A. Swami, “Non-negative matrix factorization revisited: Uniqueness and algorithm for symmetric decomposition,” IEEE Transactions on Signal Processing , vol. 62, no. 1, pp. 211–224, 2014.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Blind Audio Source Separation with

Abstract

Index Terms:

I Introduction

I-A Mixing assumptions

I-B NMF for audio source separation

I-C Contribution and outline of the paper

II Minimum-volume NMF with β\betaβ-divergences

II-A Geometry and the min-vol β\betaβ-NMF model

II-B Normalization and identifiability of min-vol β\betaβ-NMF

Theorem 1**.**

Proof.

III Algorithm for min-vol β\betaβ-NMF

Definition 1**.**

Lemma 1**.**

Proof.

III-A Separable auxiliary functions for β\betaβ-divergences

Lemma 2** ([5]).**

III-B A separable auxiliary function for the minimum-volume regularizer

Lemma 3**.**

Proof.

Remark 1** (Choice of the auxiliary function).**

III-C Auxiliary function for min-vol β\betaβ-NMF

Corollary 1**.**

Proof.

III-D Algorithm for min-vol KL-NMF

IV Numerical experiments

Mary had a little lamb

Prelude of Bach

Bass and drums

Runtime performance

V Conclusion and Perspectives

Acknowledgments

-A Sufficiently scattered condition and identifiability

Lemma 4**.**

Definition 2**.**

Proof of Theorem 1.

-B Proof of Lemma 3

-C Algorithm for min-vol IS-NMF

II Minimum-volume NMF with $\beta$ -divergences

II-A Geometry and the min-vol $\beta$ -NMF model

II-B Normalization and identifiability of min-vol $\beta$ -NMF

Theorem 1.

III Algorithm for min-vol $\beta$ -NMF

Definition 1.

Lemma 1.

III-A Separable auxiliary functions for $\beta$ -divergences

Lemma 2 ([5]).

Lemma 3.

Remark 1 (Choice of the auxiliary function).

III-C Auxiliary function for min-vol $\beta$ -NMF

Corollary 1.

Lemma 4.

Definition 2.