Blind Audio Source Separation with Minimum-Volume Beta-Divergence NMF
Valentin Leplat, Nicolas Gillis, Man Shun Ang

TL;DR
This paper introduces a novel NMF-based model with a volume penalty for blind audio source separation, demonstrating improved interpretability and automatic source number estimation under noiseless conditions.
Contribution
The paper proposes a new NMF model with a volume penalty term, providing theoretical guarantees for source identification and automatic model order selection.
Findings
More interpretable separation results compared to standard NMF.
Effective source recovery even when the number of sources is overestimated.
Automatic zeroing of sources in overestimated scenarios.
Abstract
Considering a mixed signal composed of various audio sources and recorded with a single microphone, we consider on this paper the blind audio source separation problem which consists in isolating and extracting each of the sources. To perform this task, nonnegative matrix factorization (NMF) based on the Kullback-Leibler and Itakura-Saito -divergences is a standard and state-of-the-art technique that uses the time-frequency representation of the signal. We present a new NMF model better suited for this task. It is based on the minimization of -divergences along with a penalty term that promotes the columns of the dictionary matrix to have a small volume. Under some mild assumptions and in noiseless conditions, we prove that this model is provably able to identify the sources. In order to solve this problem, we propose multiplicative updates whose derivations are based on…
| 0 | 0 |
| , |
|---|
| where (resp. ) is the Hadamard product (resp. division) between and , is the element-wise exponent of , |
| is the -by- all-one matrix, and with , , . |
| Algorithms | Source 1: bass | Source 2: drums | ||||
| SDR(dB) | SIR(dB) | SAR(dB) | SDR(dB) | SIR(dB) | SAR(dB) | |
| min-vol KL-NMF | -1.14 | 0.12 | 7.78 | 9.60 | 19.8 | 10.09 |
| baseline KL-NMF | -4.26 | -1.39 | 2.64 | 7.97 | 9.00 | 15.25 |
| sparse KL-NMF | -4.69 | -1.73 | 2.33 | 7.89 | 8.96 | 14.98 |
| Algorithms | runtime in seconds | ||
| setup 1 | setup 2 | setup 3 | |
| baseline KL-NMF | 0.440.03 | 0.430.01 | 3.810.19 |
| min-vol KL-NMF | 3.790.13 | 2.390.30 | 10.191.28 |
| sparse KL-NMF | 0.200.02 | 0.200.01 | 2.21 0.26 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Blind Audio Source Separation with
Minimum-Volume Beta-Divergence NMF
Valentin Leplat, Nicolas Gillis, Andersen M.S. Ang
- Department of Mathematics and Operational Research, Faculté Polytechnique, Université de Mons, Rue de Houdain 9, 7000 Mons, Belgium. Authors acknowledge the support by the Fonds de la Recherche Scientifique - FNRS and the Fonds Wetenschappelijk Onderzoek - Vlanderen (FWO) under EOS Project no O005318F-RG47, and by the European Research Council (ERC starting grant no 679515). E-mails: {valentin.leplat, nicolas.gillis, manshun.ang}@umons.ac.be. Manuscript received in July 2019. Accepted April 2020.
Abstract
Considering a mixed signal composed of various audio sources and recorded with a single microphone, we consider on this paper the blind audio source separation problem which consists in isolating and extracting each of the sources. To perform this task, nonnegative matrix factorization (NMF) based on the Kullback-Leibler and Itakura-Saito -divergences is a standard and state-of-the-art technique that uses the time-frequency representation of the signal. We present a new NMF model better suited for this task. It is based on the minimization of -divergences along with a penalty term that promotes the columns of the dictionary matrix to have a small volume. Under some mild assumptions and in noiseless conditions, we prove that this model is provably able to identify the sources. In order to solve this problem, we propose multiplicative updates whose derivations are based on the standard majorization-minimization framework. We show on several numerical experiments that our new model is able to obtain more interpretable results than standard NMF models. Moreover, we show that it is able to recover the sources even when the number of sources present into the mixed signal is overestimated. In fact, our model automatically sets sources to zero in this situation, hence performs model order selection automatically.
Index Terms:
nonnegative matrix factorization, -divergences, minimum-volume regularization, identifiability, blind audio source separation, model order selection
I Introduction
Blind audio source separation concerns the techniques used to extract unknown signals called sources from a mixed audio signal . In this paper, we assume that the audio signal is recorded with a single microphone. Considering a mixed signal composed of various audio sources, the blind audio source separation consists in isolating and extracting each of the sources on the basis of the single recording. Usually, the only known information is the number of estimated sources present in the mixed signal. The blind source separation problem is said to be underdetermined as there are fewer sensors (only one in our case) than sources. It then appears necessary to find additional information to make the problem well posed. The most common technique used for this kind of problem is to get some form of redundancy in the mixed signal in order to make it overdetermined. This is typically done by computing the spectrogram which represents the signal in the time and frequency domains simultaneously (splitting the signals into overlapping time frames). The computation of spectrograms can be summarized as follows: short time segments are extracted from the signal and multiplied element wise by a window function or “smoothing” window of size . Successive windows overlap by a fraction of their length, which is usually taken as 50%. On each of these segments, a discrete Fourier transform is computed and stacked column-by-column in a matrix . Thus, from a one-dimensional signal , we obtain a complex matrix called spectrogram where (due to the 50% overlap between windows). Note that the length of the window determines the shape of the spectrogram. These preliminary operations correspond to computing the short time Fourier transform (STFT), which is given by the following formula: for and , , where is the smoothing window of size , is a shift parameter (also called hop size), and is the overlap parameter. The number of rows corresponds to the frequency resolution. Letting be the sampling rate of the audio signal, consecutive rows correspond to frequency bands that are Hz apart.
The time-frequency representation of a signal highlights two of its fundamental properties: sparsity and redundancy. Sparsity comes from the fact that most real signals are not active at all frequencies at all time points. Redundancy comes from the fact that frequency patterns of the sources repeat over time. Mathematically, this means that the spectrogram is a low-rank matrix. These two fundamental properties led sound source separation techniques to integrate algorithms such as nonnegative matrix factorization (NMF). Such techniques retrieve sensible solutions even for single-channel signals.
I-A Mixing assumptions
Given source signals for , we assume the acquisition process is well modelled by a linear instantaneous mixing model:
[TABLE]
Therefore, for each time index , the mixed signal from a single microphone is the sum of the source signals. It is standard to assume that microphones are linear as long as the recorded signals are not too loud. If signals are too loud, they are usually clipped. The mixing process is modelled as instantaneous as opposed to convolutive used to take into account sound effects such as reverberation. The source separation problem consist in finding source estimates of sources for all . Let us denote the linear STFT operator, and let be its conjugate transpose. We have , where is the identity matrix of appropriate dimension. For the remainder of this paper, stands for the inverse short time Fourier transform. Note that the term inverse is not meant in a mathematical sense. Indeed the STFT is not a surjective transformation from to . In other words, each spectrogram or each matrix with complex entries is not necessarily the STFT of a real signal; see [1] and [2] for more details. By applying the STFT operator to (1), we obtain the mixing model in the time-frequency domain :
[TABLE]
where is the STFT of the source , that is, the spectrogram of source . To identify the sources, we use in this paper the amplitude spectrogram defined as for all , . We assume that , which means that there is no sound cancellation between the sources, which is usually the case in most signals. Finally, we assume that the source spectrograms are well approximated by nonnegative rank-one matrices. This leads to the NMF model described in the next section. Note that a source can be made of several rank-one factors in which case a post-processing step will have to recombine them a posteriori (e.g., looking at the correspondence in the activation of the sources over time). Note also that we focus on the NMF stage of the source separation which factorizes into the source spectrograms. For the phases reconstruction, which is a highly non-trivial problem, we consider a naive reconstruction procedure consisting in keeping the same phase as the input mixture for each source [1].
I-B NMF for audio source separation
Given a non-negative matrix (the spectrogram) and a positive integer (the number of sources, called the factorization rank), NMF aims to compute two non-negative matrices with columns and with rows such that . NMF approximate each column of by a linear combination of the columns of weighted by the components of the corresponding column of [3]. When the matrix corresponds to the amplitude spectrogram or the power spectrogram of an audio signal, we have that
is referred as the dictionary matrix and each column corresponds to the spectral content of a source, and
is the activation matrix specifying if a source is active at a certain time frame and in which intensity.
In other words, each rank-one factor will correspond to a source: the th column of is the spectral content of source , and the th row of is its activation over time. To compute and , NMF requires to solve the following optimization problem
[TABLE]
where means that is component-wise nonnegative, and is an appropriate measure of fit. In audio source separation, a common measure of fit is the discrete -divergence denoted and equal to
[TABLE]
For , this the standard squared Euclidean distance, that is, the squared Frobenius norm . For and , the -divergence corresponds to the Kullback-Leibler (KL) divergence and the Itakura-Saito (IS) divergence, respectively. The error measure which should be chosen accordingly with the noise statistic assumed on the data. The Frobenius norm assumes i.i.d. Gaussian noise, KL divergence assumes additive Poisson noise, and the IS divergence assumes multiplicative Gamma noise [4]. The -divergence is homogeneous of degree : . It implies that factorizations obtained with (such as the Euclidean distance or the KL divergence) will rely more heavily on the largest data values and less precision is to be expected in the estimation of the low-power components. The IS divergence () is scale-invariant that is [5]. The IS divergence is the only one in the -divergences family to possess this property. It implies that time-frequency areas of low power are as important in the divergence computation as the areas of high power. This property is interesting in audio source separation as low-power frequency bands can perceptually contribute as much as high-power frequency bands. Note that both KL and IS divergences are more adapted to audio source separation than Euclidean distance as it is built on logarithmic scale as human perception; see [1] and [5]. Moreover, the -divergence is only convex with respect to (or ) if . Otherwise, the objective function is non-convex. This implies that, for , even the problem of inferring with fixed is non-convex. For more details on -divergences; see [5].
I-C Contribution and outline of the paper
In Section II, we propose a new NMF model, referred to as minimum-volume -NMF (min-vol -NMF), to tackle the audio source separation problem. This model penalizes the columns of the dictionary matrix so that their convex hull has a small volume. To the best of our knowledge, this model is novel in two aspects: (1) it is the first time a minimum-volume penalty is associated with a -divergence for and it is the first time such models are used in the context of audio source separation, and (2) as opposed to most previously proposed minimum-volume NMF models, our model imposes a normalization constraints on the factor instead of . As far as we know, the only other paper that used a normalization of is [6] but the authors did not justify this choice compared to the normalization of (the choice seems arbitrary, motivated by the ‘elimination of the norm indeterminacy’), nor provided theoretical guarantees. In this paper, we explain why normalization of is a better choice in practice, and we prove that, under some mild assumptions and in the noiseless case, this model provably identify the sources; see Theorem 1. To the best of our knowledge, this is the first result of this type in the audio source separation literature. In Section III, we propose an algorithm to tackle min-vol -NMF, focusing on the KL and IS divergences. The algorithm is based on multiplicative updates (MU) that are derived using the standard majorization-minimization framework, and that monotonically decrease the objective function. In Section IV, we present several numerical experiments, comparing min-vol -NMF with standard NMF and sparse NMF. The two mains conclusions are that (1) minimum-volume -NMF performs consistently better to identify the sources, and (2) as opposed to NMF and sparse NMF, min-vol -NMF is able to detect when the factorization rank is overestimated by automatically setting sources to zero.
II Minimum-volume NMF with -divergences
In this section, we present a new model of separation based on the minimization of -divergences including a penalty term promoting solutions with minimum volume spanned by the columns of the dictionary matrix . Section II-A recalls the geometric interpretation of NMF which motivated the use of a minimum volume penalty on the dictionary . Section II-B discusses the new proposed normalization compared to previous minimum volume NMF models, and proves that min-vol -NMF provably recovers the true factors under mild conditions and in the noiseless case; see Theorem 1.
II-A Geometry and the min-vol -NMF model
As mentioned earlier, means that each column of is a linear combination of the columns of weighted by the components of the corresponding column of ; in fact, for , where denotes the th column of data matrix . This gives to NMF a nice geometric interpretation: for all
[TABLE]
meaning that the columns of are contained in the convex cone generated by the columns of ; see Figure 1 for an illustration.
From this interpretation, it follows that, in general, NMF decompositions are not unique because there exists several (often, infinitely many) sets of columns of that span the convex cone generated by the data points; see for example [8] for more details. Hence, NMF is in most cases ill-posed because the optimal solution is not unique. In order to make the solution unique (up to permutation and scaling of the columns of and the rows of ) hence making the problem well-posed and the parameters of the problem identifiable, a key idea is to look for a solution with minimum volume. Intuitively, we will look for the cone containing the data points and as close as possible to these data points. The use of minimum-volume NMF has lead to a new class of NMF methods that outperforms existing ones in many applications such as document analysis and blind hyperspectral unmixing; see the recent survey [9]. Note that minimum-volume NMF implicitly enhances the factor to be sparse: the fact that has a small volume implies that many data points will be located on the facets of the hence will be sparse.
Hence, in this paper, we consider the following model, referred to as min-vol -NMF:
[TABLE]
where \Delta^{F}=\left\{x\in\mathbb{R}^{F}_{+}\big{|}\sum_{i=1}^{F}x_{i}=1\right\} is the unit simplex, is a penalty parameter and is a function that measures the volume spanned by the columns of . In this paper, we use , where is a small positive constant that prevents to go to when tends to a rank-deficient matrix (that is, when ). The reason for using such a measure is that is the volume of the convex hull of the columns of and the origin. This measure is one of the most widely used ones, and has been shown to perform very well in practice [10, 11]. Moreover, the criterion is able to distinguish two rank-deficient solutions and favour solutions for with smaller volume [12]. Finally, as we will illustrate in Section IV, this criterion is able to identify the right number of source even when is overestimated, by putting some rank-one factors to zero.
II-B Normalization and identifiability of min-vol -NMF
As mentioned above, under some appropriate conditions on , minimum-volume NMF models will provably recover the ground-truth that generated , up to permutation and scaling of the rank-one factors. The first identifiability results for minimum-volume NMF models assumed that the entries in each column of sum to one, that is, that where is the all-one column vector whose dimension is clear from the context, meaning that is column stochastic [8, 13]. Under this condition, each column of lies in the convex hull of the columns of ; see Figure 2 for an illustration.
Under the three assumptions that (1) is column stochastic, (2) is full column rank, and (3) satisfies the sufficiently scattered condition, minimizing the volume of such that recover the true underlying factors, up to permutation and scaling. Intuitively, the sufficiently scattered condition requires to be sparse enough so that data points are located on the facets of ; see Appendix -A for a formal definition. The sufficiently scattered condition makes sense for most audio source data sets as it is reasonable to assume that, for most time points, only a few sources are active hence is sparse; see [9] for more details on the sufficiently scattered condition. Note that the sufficiently scattered condition is a generalization of the separability condition which requires for some index set of size [14]. However, separability is a much stronger assumption as it requires that, for each sources, there exists a time point where only that source is active. Note that although min-vol NMF guarantees identifiability, the corresponding optimization problem (2) is still hard to solve in general, as for the original NMF problem [15].
Despite this nice result, the constraint makes the NMF model less general and does not apply to all data sets. In the case where the data does not naturally belong to a convex hull, one needs to normalize the data points so that their entries sum to one so that can be assumed without loss of generality (in the noiseless case). This normalization can sometimes increase the noise and might greatly influence the solution, hence are usually not recommended in practice; see the discussion in [9].
In [7], authors show that identifiability still holds when the condition that is column stochastic is relaxed to being row stochastic. As opposed to column stochasticity, row stochasticity of can be assumed without loss of generality since any factorization can be properly normalized so that this assumption holds. In fact, for any for . In other terms, letting be the diagonal matrix with for , we have where is row stochastic.
Similarly as done in [7], we prove in this paper that requiring to be column stochastic (which can also be made without loss of generality) also leads to identifiability. Geometrically, the columns of are constrained to be on the unit simplex. Minimizing the volume still makes a lot of sense: we want the columns of to be as close as possible to one another within the unit simplex. In Appendix -A, we prove the following theorem.
Theorem 1**.**
Assume with , and satisfies the sufficiently scattered condition (Definition 2 in Appendix -A). Then the optimal solution of
[TABLE]
recovers up to permutation and scaling.
Proof.
See Appendix -A. ∎
In noiseless conditions, replacing with in (3) leads to the same identifiability result; see [7, Theorem 1]. Therefore, in noiseless conditions and under the conditions of Theorem 1, both models return the same solution up to permutation and scaling. However, in the presence of noise, we have observed that the two models may behave very differently. In fact, we advocate that the constraint is better suited for noisy real-world problems, which we have observed on many numerical examples. In fact, we have observed that the normalization is much less sensitive to noise and returns much better solutions. The reason is mostly twofold:
(i) As described above, using the normalization amounts to multiply by a diagonal matrix whose entries are the norms of the rows of . Therefore, the columns of that correspond to dominating (resp. dominated) sources, that is, sources with much more (resp. less) power and/or active at many (resp. few) time points, will have much higher (resp. lower) norm. Therefore, the term is much more influenced by the dominating sources and will have difficulties to penalize the dominated sources. In other terms, the use of the term with the normalization implicitly requires that the rank-one factors for are well balanced, that is, have similar norms. This is not the case for many real (audio) signals.
(ii) As it will be explained in Section III, the update of needs the computation of the matrix which is the inverse of –this terms appears in the gradient with respect to of the objective function. The numerical stability for such operations is related to the condition number of . For a normalization on the columns of , the condition number is bounded above as follows: , where and are the smallest and largest singular values of , respectively. In the numerical experiments, we use . On the other hand, the normalization may lead to arbitrarily large values for the condition number of , which we have observed numerically on several examples. This issue can be mitigated with the use of the normalization for some sufficiently large for which identifiabilty still holds [7]. However, it still performs worse because of the first reason explained above.
For these reasons, we believe that the model (3) would also be better suited (compared to the normalization on ) in other contexts; for example for document classification [16].
III Algorithm for min-vol -NMF
Most NMF algorithms alternatively update for fixed and vice versa, and we adopt this strategy in this paper. For fixed, (2) is equivalent to standard NMF and we will use the MU that have already been derived in the literature [3, 5].
To tackle (2) for fixed, let us consider
[TABLE]
Note that, for now, we have discarded the normalization on the columns of . In our algorithm, we will use the update for obtained by solving (4) as a descent direction along with a line search procedure to integrate the constraint on . This will ensure that the objective function is non-increasing at each iteration. In the following sections we derive MU for that decrease the objective in (4). We follow the standard majorization-minimization framework [17]. First, an auxiliary function, which we denote , is constructed so that it majorizes the objective. An auxiliary function for at point is defined as follows.
Definition 1**.**
The function is an auxiliary function for at if the conditions for all and are satisfied.
Then, the optimization of can be replaced by an iterative process that minimizes . More precisely, the new iterate is computed by minimizing exactly the auxiliary function at the previous iterate . This guarantees to decrease at each iteration.
Lemma 1**.**
Let , and let be an auxiliary function for at . Then is non-increasing under the update .
Proof.
In fact, we have by definition that . ∎
The most difficult part in using the majorization-minimization framework is to design an auxiliary function that is easy to optimize. Usually such auxiliary functions are separable (that is, there is no interaction between the variables so that each entry of can be updated independently) and convex.
III-A Separable auxiliary functions for -divergences
For the sake of completeness, we briefly recall the auxiliary function proposed in [5] for the data fitting term. It consists in majorizing the convex part of the -divergence using Jensen’s inequality and majorizing the concave part by its tangent (first-order Taylor approximation). We have
[TABLE]
where is convex function of , is a concave function of and is a constant of ; see Table I.
The function can be written as where and are respectively the th row of and . Therefore we only consider the optimization over one specific row of . To simplify notation, we denote iterates (next iterate) and (current iterate) as and , respectively.
Lemma 2** ([5]).**
Let and be such that for all and for all . Then the function
[TABLE]
is an auxiliary function for at .
III-B A separable auxiliary function for the minimum-volume regularizer
The minimum-volume regularizer is a non-convex function. However, it can be upper-bounded using the fact that is a concave function so that its first-order Taylor approximation provides an upper bound; see for example [10]. For any positive-definite matrices and , we have:
[TABLE]
This implies that for any , we have
[TABLE]
where , with . Note that is positive definite hence is invertible and its inverse is also positive definite. Finally is an auxiliary function for at . However, it is quadratic and not separable hence non-trivial to optimize over the nonnegative orthant. The non-constant part of can be written as where is the th row of . Henceforth we will focus on one particular row vector with which will be further considered as a column vector of size .
Lemma 3**.**
Let be such that for all , with and , and be the diagonal matrix where is the component-wise division between and , and . Then
[TABLE]
is a separable auxiliary function for = at .
Proof.
See Appendix -B. ∎
Remark 1** (Choice of the auxiliary function).**
A simpler choice for the auxiliary function would be to replace with where is the largest eigenvalue of (the constant appears because while there is a factor in front of ). However, it would lead to a worse approximation. In particular if is a diagonal matrix (since , these diagonal elements are positive), our choice gives for any , meaning that the auxiliary function matches perfectly the function . This would not be the case for the choice (unless is a scaling of the identity matrix).
III-C Auxiliary function for min-vol -NMF
Based on the auxiliary functions presented in Sections III-A and III-B, we can directly derive a separable auxiliary function for min-vol -NMF (2).
Corollary 1**.**
For , , with and the constant , the function
[TABLE]
where is given by (6) and by (8), is a convex and separable auxiliary function for at .
Proof.
This follows directly from Lemma 2, Equation (7) and Lemma 3. ∎
In the following section, we provide explicitly MU for the KL divergence () by finding a closed-form solution for the minimization of . In Appendix -C, we provide the MU for the IS divergence (). Due to the lack of space, the other cases are not treated explicitly but can be in a similar way. For the same reason, we will only compare KL NMF models in the numerical experiments (Section IV).
III-D Algorithm for min-vol KL-NMF
As before, let us focus on a single row of , denoted , as the objective function is separable by row. For , the derivative of the auxiliary function with respect to a specific coefficient is given by . Due to the separability, we set the derivative to zero to obtain the closed-form solution, which is given in Table II in matrix form.
Note that although the closed-form solution has a negative term in the numerator of the multiplicative factor (see Table II), they always remain nonnegative given that and are nonnegative. In fact, the term before the minus sign is always larger than the term after the minus sign: is squared (component wise) and added a positive term, hence the component-wise square root of that result is larger than .
Algorithm 1 summarizes our algorithm to tackle (2) for which we refer to as min-vol KL-NMF. Note that the update of (step 4) is the one from [3]. More importantly, note that we have incorporated a line-search for the update of . In fact, although the MU for are guaranteed to decrease the objective function, they do not guarantee that remains normalized, that is, that for all . Hence, we normalize after it is updated (step 10), and we normalize accordingly so that remains unchanged. When this normalization is performed, the -divergence part of is unchanged but the minimum-volume penalty will change so that the objective function might increase. In order to guarantee non-increasingness, we integrate a simple backtracking line-search procedure; see steps 11-16 of Algorithm 1. In summary, our MU provide a descent direction that preserved nonnegativity of the iterates, and we use a projection and a simple backtracking line-search to guarantee the monotonicity of the objective function, as in standard projected gradient descent methods.
It can be verified that the computational complexity of the min-vol KL-NMF is asymptotically equivalent to the standard MU for -NMF, that is, it requires operations per iteration. Indeed, all the main operations include matrix products with a complexity of and element-wise operations on matrices of size or . Note that the inversion of the -by- matrix requires operations which is dominated by since (in fact, typically hence this term is negligible). Therefore, although Algorithm 1 will be slower than the baseline KL-NMF (that is, the standard MU) because of the additional terms to be computed and the line-search, the asymptotical computational cost is the same; see Table IV for runtime comparison.
IV Numerical experiments
In this section we report an experimental comparative study of baseline KL-NMF, min-vol KL-NMF (Algorithm 1) and sparse KL-NMF [18] applied to the spectrogram of two monophonic piano sequences and a synthetic mix of a bass and drums. For the two monophonic piano sequences, the audio signals are true life signals with standard quality. Note that the sequences are made of pure piano notes, the number should therefore correspond to the number of notes present into the mixed signals. The comparative study is performed for several values of with a focus on the case where the factorization rank is overestimated. For all simulations, random initializations are used for and , the best results among 5 runs are kept for the comparative study. In all cases, we use a Hamming window of size =1024, and 50% overlap between two frames. Sparse KL-NMF has a similar structure as min-vol KL-NMF, with a penalty parameter for the sparsity enforcing regularization. To tune these two parameters, we have used the same strategy for both methods: we manually tried a wide range of values and report the best results. The code is available from bit.ly/minvolKLNMF (code written in MATLAB R2017a), and can be used to rerun directly all experiments below. They were run on a laptop computer with Intel Core i7-7500U CPU 2.70GHz 4 and 32GB memory.
Mary had a little lamb
The first audio sample is the first measure of “Mary had a little lamb”. The sequence is composed of three notes; , and , played all at once. The recorded signal is 4.7 seconds long and downsampled to Hz yielding =75200 samples. STFT of the input signal yields a temporal resolution of 16ms and a frequency resolution of 31.25Hz, so that the amplitude spectrogram has =294 frames and =257 frequency bins. The musical score is shown on Figure 3.
All NMF algorithms were run for 200 iterations which allowed them to converge. Figure 4 presents the columns of (dictionary matrix) and the rows of for baseline KL-NMF and min-vol KL-NMF with .
Figure 5 presents the time-frequency masking coefficients. These coefficients are computed as follows
[TABLE]
where is the estimated source . The masks are nonnegative and sum to one for each pair . This representation allows to identify visually whether the NMF algorithm was able to separate the sources properly.
All the simulations give a nice separation with similar results for and . The activations are coherent with the sequences of the notes. However, Figure 5 shows that min-vol KL-NMF and sparse KL-NMF provide a better separation in terms of time-frequency localization compared to the baseline KL-NMF.
We now perform the same experiment but using =7. Figure 6 presents the results. This situation corresponds to the situation where the factorization rank is overestimated. Figure 7 presents the time-frequency masking coefficients.
We observe that min-vol KL-NMF is able to extract the three notes correctly and set automatically to zero three source estimates (more precisely, three rows of are set to zero, while the corresponding columns of have entries equal to one another as for all ) while baseline KL-NMF and sparse KL-NMF split the notes in all the sources. One can observe that a fourth note is identified in all simulations (see isolated peaks on Figure 7-(b), second row of from the top) and corresponds to the noise within the piano just before triggering a specific note (in particular, the hammer noise). This observation is confirmed by the fact that the amplitude is proportional to the natural strength of the fingers playing the notes. In this scenario, with is overestimated, min-vol KL-NMF outperforms baseline KL-NMF and sparse KL-NMF.
Prelude of Bach
The second audio sample corresponds to the first 30 seconds of “Prelude and Fugue No.1 in C major” from J. S. Bach played by Glenn Gould111https://www.youtube.com/watch?v=ZlbK5r5mBH4. The audio sample is a sequence of 13 notes: , , , , , , , , , , , , . The recorded signal is downsampled to Hz yielding =330750 samples. STFT of the input signal yields a temporal resolution of 46ms and a frequency resolution of 10.76Hz, so that the amplitude spectrogram has =647 frames and =513 frequency bins. The musical score is presented on Figure 8. All NMF algorithms were run for 300 iterations which allowed them to converge. Figure 9 presents the results obtained for and with a factorization rank , hence overestimated. We observe that min-vol KL-NMF automatically sets three components to zero (with * symbol on Figure 9) while 13 source estimates are determined. The analysis of the fundamentals (maximum peak frequency) of the 13 source estimates correspond to the theoretical fundamentals of the 13 notes mentioned earlier. Note that using baseline KL-NMF or sparse KL-NMF led to same conclusions as for the first audio sample; these two algorithms generate as many source estimates as imposed by the rank of factorization while min-vol KL-NMF algorithm preserves the integrity of the 13 sources. Additionally, the activations are coherent with the sequences of the notes. Figure 10 shows (on a limited time interval) that the estimate sequence follows the sequence defined in the score. Note that a threshold and permutations on rows of was used to improve visibility.
Bass and drums
The third audio signal is a synthetic mix of a bass and drums222http://isse.sourceforge.net/demos.html. The audio signal is downsampled to =Hz yielding =104821 samples. STFT of the input signal yields a temporal resolution of 32ms and a frequency resolution of 15.62Hz, so that the amplitude spectrogram has =206 frames and =513 frequency bins. For this synthetic mix, we have access to the true sources under the form of two audio files. Therefore, we can estimate the quality of the separation with standard metrics, namely the signal to distortion ratios (SDR), the source to interference ratios (SIR) and the sources to artifacts ratios (SAR) [19]. They have been computed with the toolbox BSS Eval333http://bass-db.gforge.inria.fr/bss_eval/. The metrics are expressed in dB and the higher they are the better is the separation. Algorithms min-vol KL-NMF, baseline KL-NMF and sparse KL-NMF have been considered for this comparative study. A factorization rank equal to two is used. It is clear that the rank-one approximation is too simplistic for these sources but the goal is to compare the algorithms and show that min-vol KL-NMF is able to find a better solution even in this simplified context. All NMF algorithms were run for 400 iterations which allowed them to converge. Table III shows the results.
Except for SAR metric for the second source (drums), min-vol KL-NMF outperforms baseline KL-NMF and sparse KL-NMF.
Runtime performance
Let us compare the runtime of baseline KL-NMF, min-vol KL-NMF (Algorithm 1) and sparse KL-NMF [18]. The algorithms are compares on the three examples presented in paragraphs IV and IV:
- •
Setup 1: sample “Mary had a little lamb” with , 200 iterations.
- •
Setup 2: sample “Mary had a little lamb” with , 200 iterations.
- •
Setup 3: “Prelude and Fugue No.1 in C major” with , 300 iterations.
For each test setup, the algorithms are run for the same 20 random initializations of and . Table IV reports the average and standard deviation of the runtime (in seconds) over these 20 runs. We observe that the runtime of min-vol KL-NMF (Algorithm 1) is slower but not significantly so, as expected. In particular, on the larger setup 3, it is less than three times slower than the standard MU.
V Conclusion and Perspectives
In this paper, we have presented a new NMF model of audio source separation based on the minimization of a cost function that includes a -divergence (data fitting term) and a penalty term that promotes solutions with minimum volume. We have proved the identifiability of the model in the exact case, under the sufficiently scattered condition for the activation matrix . We have provided multiplicative updates to tackle this problem and have illustrated the behaviour of the method on real-world audio signals. We highlighted the capacity of the model to deal with the case where is overestimated by setting automatically to zero some components and give good results for the source estimates.
Further work includes tackling the following questions:
- •
Under which conditions can we prove the identifiability of min-vol -NMF in the presence of noise, and the rank-deficient case?
- •
Can we prove that min-vol -NMF performs model order selection automatically? Under which conditions? We have observed this behaviour on many examples, but the proof remains elusive.
- •
Can we design more efficient algorithms?
Further work also includes the use of our new model min-vol -NMF for other applications and the design of more efficient algorithms (for example, that avoid using a line-search procedure) with stronger convergence guarantees (beyond the monotonicity of the objective function).
Acknowledgments
We thank Kejun Huang and Xiao Fu for helpful discussion on Theorem 1, and giving us the insight to adapt their proof from [7] to our model (2). We also thank the reviewers for their insightful comments that helped us improve the paper.
-A Sufficiently scattered condition and identifiability
Before giving the definition of the sufficiently scattered condition from [8], let us first recall an important property of the duals of nested cones.
Lemma 4**.**
Let and be convex cones such that . Then where and are respectively the dual cones of and . The dual of a cone is defined as .
Definition 2**.**
(Sufficiently Scattered) A matrix is sufficiently scattered if
, and 2. 2.
,
where is a second order cone, , is the conic hull of the columns of , and bd denotes the boundary of a set.
We can now prove Theorem 1.
Proof of Theorem 1.
Recall that and are the true latent factors that generated , with and is sufficiently scattered. Let us consider and a feasible solution of (3). Since and , we must have . Hence there exists an invertible matrix such that and . Since is a feasible solution of problem (3), we have
[TABLE]
where we assumed without loss of generality since and . Note that is equivalent to . This means that matrix is column stochastic. Therefore we have that . Since is a feasible solution, we also have . Let us denote by the th row of A, and by the th column of . By the definition of the a dual cone, means that the rows for . Since is sufficiently scattered, (by Lemma 4) hence . Therefore we have by definition of . This leads to the following: . The first inequality is the Hadamard inequality, the second inequality is due to , the third inequality is the arithmetic-geometric mean inequality. Now we can conclude exactly as is done in [7, Theorem 1] by showing that matrix can only be a permutation matrix for an optimal solution (,) of (3), and therefore identifiability for model (3) holds. ∎
-B Proof of Lemma 3
Separability of holds since \color[rgb]{0,0,0}\Phi\left(\tilde{w}\right)\color[rgb]{0,0,0} is diagonal. The condition from Definition 1 can be checked easily. It remains to prove that for all . Let us first rewrite the quadratic function using its Taylor expansion at : l(w)=l(\tilde{w})+\left(w-\tilde{w}\right)^{T}\nabla l\left(\tilde{w}\right)+\frac{1}{2}\left(w-\tilde{w}\right)^{T}\nabla^{2}l\left(\tilde{w}\right)\left(w-\tilde{w}\right)=l(\tilde{w})+\left(w-\tilde{w}\right)^{T}\color[rgb]{0,0,0}2\color[rgb]{0,0,0}Y\tilde{w}+\frac{1}{2}\left(w-\tilde{w}\right)^{T}2Y\left(w-\tilde{w}\right). Proving that is equivalent to proving that \frac{1}{2}\left(w-\tilde{w}\right)^{T}\left[\color[rgb]{0,0,0}\Phi\left(\tilde{w}\right)\color[rgb]{0,0,0}-2Y\right]\left(w-\tilde{w}\right)\geq 0, which boils down to proving that the matrix \left[\color[rgb]{0,0,0}\Phi\left(\tilde{w}\right)\color[rgb]{0,0,0}-2Y\right] is positive semi-definite. We have \color[rgb]{0,0,0}\Phi_{ij}(\tilde{w})\color[rgb]{0,0,0}=2\delta_{ij}\frac{(Y^{+}\tilde{w})_{i}+(Y^{-}\tilde{w})_{i}}{\tilde{w}_{i}}, where is the Kronecker symbol. Let us consider the following matrix: M_{ij}(\tilde{w})=\tilde{w}_{i}\left[\color[rgb]{0,0,0}\Phi\left(\tilde{w}\right)\color[rgb]{0,0,0}-2Y\right]_{ij}\tilde{w}_{j}, which is a rescaling of \left[\color[rgb]{0,0,0}\Phi\left(\tilde{w}\right)\color[rgb]{0,0,0}-2Y\right]. It remains to show that is positive semi-definite444The remainder of the proof was suggested to us by one of the reviewers, it is more elegant and simpler than our original proof.. Since is symmetric and its diagonal entries are non-negative, it is sufficient to show that is diagonally dominant [horn1985matrix, Proposition 7.2.3], that is,
[TABLE]
We have for all that
[TABLE]
Since , we have
[TABLE]
implying that is diagonally dominant.
-C Algorithm for min-vol IS-NMF
For (IS divergence), the derivative of the auxiliary function with respect to a specific coefficient is given by:
[TABLE]
Let
[TABLE]
Setting the derivative to zero requires to compute the roots of the following degree-three polynomial . We used the procedure developed in [20] which is based on the explicit calculation of the intermediary root of a canonical form of cubic. This procedure is able to provide highly accurate numerical results even for badly conditioned polynomials. The algorithm for min-vol IS-NMF follows the same steps as for min-vol KL-NMF: only the two steps corresponding to the updates of and have to be modified. For the update of (step 4), use the standard MU. For the update of (step 9), use
for to
for to
Compute the roots of
Pick among these roots and zero that minimizes
the objective
end for
end for
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] A. Lefèvre, “Méthode d’apprentissage de dictionnaire pour la séparation de sources audio avec un seul capteur,” Ph.D. dissertation, Ecole Normale Supérieure de Cachan, 2012.
- 2[2] P. Magron, “Reconstruction de phase par modèles de signaux : application à la séparation de sources audio,” Ph.D. dissertation, TELECOM Paris Tech, 2016.
- 3[3] D. Lee and H. Seung, “Algorithms for non-negative matrix factorization,” in NIPS’00 Proceedings of the 13th International Conference on Neural Information Processing Systems , NIPS. MIT Press Cambridge, 2000, pp. 535–541.
- 4[4] C. Févotte, N. Bertin, and J.-L. Durrieu, “Nonnegative matrix factorization with the Itakura-Saito divergence: With application to music analysis,” Neural computation , vol. 21, no. 3, pp. 793–830, 2009.
- 5[5] C. Févotte and J. Idier, “Algorithms for nonnegative matrix factorization with the β 𝛽 \beta -divergence,” Neural computation , vol. 23, no. 9, pp. 2421–2456, 2011.
- 6[6] G. Zhou, S. Xie, Z. Yang, J.-M. Yang, and Z. He, “Minimum-volume-constrained nonnegative matrix factorization: Enhanced ability of learning parts,” IEEE Transactions on Neural Networks , vol. 22, no. 10, pp. 1626–1637, 2011.
- 7[7] X. Fu, K. Huang, and N. D. Sidiropoulos, “On identifiability of nonnegative matrix factorization,” IEEE Signal Processing Letters , vol. 25, no. 3, pp. 328–332, 2018.
- 8[8] K. Huang, N. Sidiropoulos, and A. Swami, “Non-negative matrix factorization revisited: Uniqueness and algorithm for symmetric decomposition,” IEEE Transactions on Signal Processing , vol. 62, no. 1, pp. 211–224, 2014.
