Generalized Separable Nonnegative Matrix Factorization

Junjun Pan; Nicolas Gillis

arXiv:1905.12995·cs.LG·April 14, 2021

Generalized Separable Nonnegative Matrix Factorization

Junjun Pan, Nicolas Gillis

PDF

TL;DR

This paper introduces a generalized separability condition for nonnegative matrix factorization, allowing more flexible basis selection, and proposes efficient algorithms to solve this new formulation with applications in image, text, and spectral data analysis.

Contribution

It generalizes the separability assumption in NMF, enabling broader applicability and develops convex and heuristic algorithms for this new problem.

Findings

01

The proposed methods outperform existing algorithms on synthetic, document, and image datasets.

02

The convex optimization model is efficiently solved using a fast gradient method.

03

The heuristic algorithm shows competitive results in practical scenarios.

Abstract

Nonnegative matrix factorization (NMF) is a linear dimensionality technique for nonnegative data with applications such as image analysis, text mining, audio source separation and hyperspectral unmixing. Given a data matrix $M$ and a factorization rank $r$ , NMF looks for a nonnegative matrix $W$ with $r$ columns and a nonnegative matrix $H$ with $r$ rows such that $M \approx W H$ . NMF is NP-hard to solve in general. However, it can be computed efficiently under the separability assumption which requires that the basis vectors appear as data points, that is, that there exists an index set $K$ such that $W = M (:, K)$ . In this paper, we generalize the separability assumption: We only require that for each rank-one factor $W (:, k) H (k, :)$ for $k = 1, 2, \dots, r$ , either $W (:, k) = M (:, j)$ for some $j$ or $H (k, :) = M (i, :)$ for some $i$ . We refer to the corresponding problem as…

Tables4

Table 1. Table 1: The relative approximation quality in percent for the document data sets. Among GS-NMF and separable NMF algorithms, the highest quality is highlighted in bold, the second highest is underlined. The last line reports the average computational time in seconds for the different algorithms.

Dataset	r	$(r_{2}, r_{1})$	GSPA	$(r_{2}, r_{1})$	GS-FGM	SPA*	SPA-C	SPA-R	NMF	MV-NMF	SVD
NG10	10	(7,3)	91.61	(8,2)	91.64	91.35	91.44	91.49	92.41	92.37	92.46
TDT30	30	(7,23)	14.13	(4,26)	14.38	14.03	14.47	11.30	17.69	17.49	18.48
classic	4	(4,0)	3.58	(4,0)	3.58	3.58	3.58	1.48	5.12	3.40	5.20
reviews	5	(0,5)	8.39	(0,5)	8.39	8.39	8.39	7.69	13.25	13.06	13.48
sports	7	(0,7)	10.49	(1,6)	10.65	10.65	10.49	5.98	13.36	13.21	13.76
ohscal	10	(0,10)	10.27	(0,10)	10.27	10.27	10.27	7.03	11.23	11.14	11.49
k1b	6	(1,5)	5.76	(1,5)	7.07	5.62	5.78	4.54	9.42	9.16	9.62
la12	6	(0,6)	4.79	(0,6)	4.79	4.79	4.79	3.02	7.52	5.96	7.78
hitech	6	(3,3)	6.43	(3,3)	6.43	4.50	5.77	4.86	8.84	8.08	8.99
la1	6	(1,5)	5.05	(2,4)	5.13	4.51	5.11	3.73	7.83	6.50	8.03
la2	6	(0,6)	5.86	(0,6)	5.86	5.86	5.86	3.90	8.11	7.73	8.35
tr41	10	(5,5)	52.30	(7,3)	54.90	53.31	53.12	56.03	57.19	56.66	57.74
tr45	10	(5,5)	69.08	(7,3)	71.94	68.20	68.37	69.55	76.22	76.18	76.36
tr11	9	(6,3)	74.21	(7,2)	74.27	72.14	72.50	74.74	76.33	76.28	76.44
tr23	6	(1,5)	63.66	(5,1)	70.70	68.46	65.04	71.32	72.73	72.69	72.86
time			0.091		1.249	0.013	0.008	0.011	78.44	280.41	0.030

Table 2. Table 2: Words and documents extracted by GS-FGM on the TDT30 data set.

Documents (4)
``India won’t hesitate to deploy nuclear weapons, premier indicates”
``For media, unsavory story tests ideals and stretches limits”
``Algeria rebuffs European concerns on reported atrocities”
``Cohen promises ‘significant’ military campaign against Iraq”
Words (26)
hindu, economic, pakistan, iraq, spkr, pope, tobacco, starr, suharto,
kaczynski, percent, white, school, jones, correspondent, clinton,
companies, jordan, american, winter, lewinsky, oil, hong, hockey,
annan, president

Table 3. Table 3: The relative approximation quality in percent for the facial image data sets. Among GS-NMF and separable NMF algorithms, the highest quality is highlighted in bold, the second highest is underlined.

Dataset	r	$(r_{2}, r_{1})$	GSPA	$(r_{2}, r_{1})$	GS-FGM	SPA*	SPA-C	SPA-R	NMF	MV-NMF	SVD
CBCL	49	(1,48)	80.73	(14,35)	83.10	82.29	79.44	84.57	90.51	90.50	91.40
Frey	50	(24,26)	82.46	(39,11)	83.89	83.43	80.61	83.78	90.40	90.41	91.51
Yale	38	(13,25)	57.52	(24,14)	68.26	61.10	60.24	62.94	76.94	76.85	79.23
ORL	40	(20,20)	81.38	(28,12)	82.54	82.32	82.23	83.26	89.47	89.49	90.24

Table 4. Table 4: Computational time in seconds for the different algorithms on the image data sets.

Dataset	r	$(r_{2}, r_{1})$	GSPA	$(r_{2}, r_{1})$	GS-FGM	SPA*	SPA-C	SPA-R	NMF	MV-NMF	SVD
CBCL	49	(1,48)	1.612	(14,35)	5.076	0.075	0.070	0.128	17.52	307.63	0.605
Frey	50	(24,26)	2.616	(39,11)	5.387	0.096	0.070	0.090	23.94	392.83	0.603
Yale	38	(13,25)	1.387	(24,14)	35.746	0.053	0.048	0.044	12.10	422.49	0.318
ORL	40	(20,20)	0.470	(28,12)	0.905	0.017	0.013	0.013	4.62	159.73	0.051

Equations102

M\Pi=M\Pi\left(\begin{array}[]{cc}I_{r}&H^{\prime}\\ 0_{n-r,r}&0_{n-r,n-r}\\ \end{array}\right),

M\Pi=M\Pi\left(\begin{array}[]{cc}I_{r}&H^{\prime}\\ 0_{n-r,r}&0_{n-r,n-r}\\ \end{array}\right),

M\;=\;M\;\underbrace{\Pi\left(\begin{array}[]{cc}I_{r}&H^{\prime}\\ 0_{n-r,r}&0_{n-r,n-r}\\ \end{array}\right)\Pi^{T}}_{X\in\mathbb{R}^{n\times n}}.

M\;=\;M\;\underbrace{\Pi\left(\begin{array}[]{cc}I_{r}&H^{\prime}\\ 0_{n-r,r}&0_{n-r,n-r}\\ \end{array}\right)\Pi^{T}}_{X\in\mathbb{R}^{n\times n}}.

M = M (:, K_{1}) P_{1} + P_{2} M (K_{2}, :),

M = M (:, K_{1}) P_{1} + P_{2} M (K_{2}, :),

M (K_{2}, K_{1})

M (K_{2}, K_{1})

= M (K_{2}, K_{1}) + M (K_{2}, K_{1}),

M=\Pi_{r}\left(\begin{array}[]{cc}W_{1}&W_{1}H_{1}+W_{2}H_{2}\\ 0_{r_{2},r_{1}}&H_{2}\\ \end{array}\right)\Pi_{c},

M=\Pi_{r}\left(\begin{array}[]{cc}W_{1}&W_{1}H_{1}+W_{2}H_{2}\\ 0_{r_{2},r_{1}}&H_{2}\\ \end{array}\right)\Pi_{c},

M = M X + Y M,

M = M X + Y M,

X

X

Y

M=\Pi_{r}\left(\begin{array}[]{cc}W_{1}&W_{1}H_{1}+W_{2}H_{2}\\ 0_{r_{2},r_{1}}&H_{2}\\ \end{array}\right)\Pi_{c},

M=\Pi_{r}\left(\begin{array}[]{cc}W_{1}&W_{1}H_{1}+W_{2}H_{2}\\ 0_{r_{2},r_{1}}&H_{2}\\ \end{array}\right)\Pi_{c},

\tilde{M}\left(\begin{array}[]{cc}I_{r_{1}}&H_{1}\\ 0_{n-r_{1},r_{1}}&0_{n-r_{1},n-r_{1}}\\ \end{array}\right)+\left(\begin{array}[]{cc}0_{m-r_{2},m-r_{2}}&W_{2}\\ 0_{r_{2},m-r_{2}}&I_{r_{2}}\\ \end{array}\right)\tilde{M}.

\tilde{M}\left(\begin{array}[]{cc}I_{r_{1}}&H_{1}\\ 0_{n-r_{1},r_{1}}&0_{n-r_{1},n-r_{1}}\\ \end{array}\right)+\left(\begin{array}[]{cc}0_{m-r_{2},m-r_{2}}&W_{2}\\ 0_{r_{2},m-r_{2}}&I_{r_{2}}\\ \end{array}\right)\tilde{M}.

X \in R_{+}^{n \times n}, Y \in R_{+}^{m \times m} min

X \in R_{+}^{n \times n}, Y \in R_{+}^{m \times m} min

such that M = M X + Y M,

∥ X^{*} ∥_{r o w, 0} + ∥ Y^{*} ∥_{co l, 0} = ∣ K_{1} ∣ + ∣ K_{2} ∣ = r_{1} + r_{2} .

∥ X^{*} ∥_{r o w, 0} + ∥ Y^{*} ∥_{co l, 0} = ∣ K_{1} ∣ + ∣ K_{2} ∣ = r_{1} + r_{2} .

M_{n}=\left(\begin{array}[]{ccccc}1&0&\frac{1}{2}&0&x^{T}\\ 0&1&0&\frac{1}{2}&y^{T}\\ 0&0&\frac{1}{2}&\frac{1}{2}&z^{T}\\ \end{array}\right)

M_{n}=\left(\begin{array}[]{ccccc}1&0&\frac{1}{2}&0&x^{T}\\ 0&1&0&\frac{1}{2}&y^{T}\\ 0&0&\frac{1}{2}&\frac{1}{2}&z^{T}\\ \end{array}\right)

M_{n} = M_{n} (:, 1 : 2) P_{1} + P_{2} M_{n} (3, :),

M_{n} = M_{n} (:, 1 : 2) P_{1} + P_{2} M_{n} (3, :),

M=\left(\begin{array}[]{cc}0_{3,3}&M_{n}\\ M_{m}^{T}&0_{m,n}\end{array}\right).

M=\left(\begin{array}[]{cc}0_{3,3}&M_{n}\\ M_{m}^{T}&0_{m,n}\end{array}\right).

D_{1} M D_{2} = D_{1} M (:, K_{1}) P_{1} D_{2} + D_{1} P_{2} M (K_{2}, :) D_{2} .

D_{1} M D_{2} = D_{1} M (:, K_{1}) P_{1} D_{2} + D_{1} P_{2} M (K_{2}, :) D_{2} .

\tilde{M} = \tilde{M} (:, K_{1}) \tilde{P}_{1} + \tilde{P}_{2} \tilde{M} (K_{2}, :),

\tilde{M} = \tilde{M} (:, K_{1}) \tilde{P}_{1} + \tilde{P}_{2} \tilde{M} (K_{2}, :),

M=\left(\begin{array}[]{ccc}M_{11}&0_{m-r_{4},r_{1}-r_{3}}&M_{13}\\ 0_{r_{4}-r_{2},r_{3}}&M_{22}&M_{23}\\ 0_{r_{2},r_{3}}&0_{r_{2},r_{1}-r_{3}}&M_{33}\\ \end{array}\right),

M=\left(\begin{array}[]{ccc}M_{11}&0_{m-r_{4},r_{1}-r_{3}}&M_{13}\\ 0_{r_{4}-r_{2},r_{3}}&M_{22}&M_{23}\\ 0_{r_{2},r_{3}}&0_{r_{2},r_{1}-r_{3}}&M_{33}\\ \end{array}\right),

M_{13} = M_{11} X_{1} + Y_{1} M_{33} \in R_{+}^{(m - r_{4}) \times (n - r_{1})}, and

M_{13} = M_{11} X_{1} + Y_{1} M_{33} \in R_{+}^{(m - r_{4}) \times (n - r_{1})}, and

M_{23} = M_{22} X_{2} + Y_{2} M_{33} \in R_{+}^{(r_{4} - r_{2}) \times (n - r_{1})},

M_{23} = M_{22} X_{2} + Y_{2} M_{33} \in R_{+}^{(r_{4} - r_{2}) \times (n - r_{1})},

\displaystyle\left(\begin{array}[]{c}M_{13}\\ M_{23}\\ \end{array}\right)

\displaystyle\left(\begin{array}[]{c}M_{13}\\ M_{23}\\ \end{array}\right)

M_{11}\left(\begin{array}[]{cc}0&X_{1}\\ \end{array}\right)+\left(\begin{array}[]{cc}0&Y_{1}\\ \end{array}\right)\left(\begin{array}[]{cc}M_{22}&M_{23}\\ 0_{r_{2},r_{1}-r_{3}}&M_{33}\\ \end{array}\right).

M_{11}\left(\begin{array}[]{cc}0&X_{1}\\ \end{array}\right)+\left(\begin{array}[]{cc}0&Y_{1}\\ \end{array}\right)\left(\begin{array}[]{cc}M_{22}&M_{23}\\ 0_{r_{2},r_{1}-r_{3}}&M_{33}\\ \end{array}\right).

\left(\begin{array}[]{ccc}1&0&2\\ 0&1&2\\ 0&0&1\end{array}\right).

\left(\begin{array}[]{ccc}1&0&2\\ 0&1&2\\ 0&0&1\end{array}\right).

X \in R_{+}^{n \times n}, Y \in R_{+}^{m \times m} min

X \in R_{+}^{n \times n}, Y \in R_{+}^{m \times m} min

such that ∥ M - M X - Y M ∥ \leq ϵ,

X \in R_{+}^{n \times n}, Y \in R_{+}^{m \times m} min

X \in R_{+}^{n \times n}, Y \in R_{+}^{m \times m} min

∥ M - M X - Y M ∥ \leq ϵ,

X (i, j) \leq X (i, i) \leq 1 for 1 \leq i, j \leq n,

X (i, j) \leq X (i, i) \leq 1 for 1 \leq i, j \leq n,

and Y (l, t) \leq Y (t, t) \leq 1 for 1 \leq l, t \leq m .

and Y (l, t) \leq Y (t, t) \leq 1 for 1 \leq l, t \leq m .

M=\left(\begin{array}[]{cc}W_{1}&W_{1}H_{1}+W_{2}H_{2}\\ 0_{r_{2},r_{1}}&H_{2}\\ \end{array}\right).

M=\left(\begin{array}[]{cc}W_{1}&W_{1}H_{1}+W_{2}H_{2}\\ 0_{r_{2},r_{1}}&H_{2}\\ \end{array}\right).

k_{1} = e^{T} M (:, j) \geq e^{T} W_{1} H_{1} (:, j - r_{1}) = k_{1} e^{T} H_{1} (:, j - r_{1}),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Generalized Separable Nonnegative Matrix Factorization

Junjun Pan Nicolas Gillis

Department of Mathematics and Operational Research

Faculté Polytechnique, Université de Mons

Rue de Houdain 9, 7000 Mons, Belgium

Emails: {junjun.pan, nicolas.gillis}@umons.ac.be. This work was supported by the European Research Council (ERC starting grant no 679515), and the Fonds de la Recherche Scientifique - FNRS and the Fonds Wetenschappelijk Onderzoek - Vlanderen (FWO) under EOS Project no O005318F-RG47.

Abstract

Nonnegative matrix factorization (NMF) is a linear dimensionality technique for nonnegative data with applications such as image analysis, text mining, audio source separation and hyperspectral unmixing. Given a data matrix $M$ and a factorization rank $r$ , NMF looks for a nonnegative matrix $W$ with $r$ columns and a nonnegative matrix $H$ with $r$ rows such that $M\approx WH$ . NMF is NP-hard to solve in general. However, it can be computed efficiently under the separability assumption which requires that the basis vectors appear as data points, that is, that there exists an index set $\mathcal{K}$ such that $W=M(:,\mathcal{K})$ . In this paper, we generalize the separability assumption: We only require that for each rank-one factor $W(:,k)H(k,:)$ for $k=1,2,\dots,r$ , either $W(:,k)=M(:,j)$ for some $j$ or $H(k,:)=M(i,:)$ for some $i$ . We refer to the corresponding problem as generalized separable NMF (GS-NMF). We discuss some properties of GS-NMF and propose a convex optimization model which we solve using a fast gradient method. We also propose a heuristic algorithm inspired by the successive projection algorithm. To verify the effectiveness of our methods, we compare them with several state-of-the-art separable NMF algorithms on synthetic, document and image data sets.

Keywords. nonnegative matrix factorization, separability, algorithms

1 Introduction

Given a nonnegative matrix $M\in\mathbb{R}^{m\times n}_{+}$ and an integer factorization rank $r$ , nonnegative matrix factorization (NMF) is the problem of computing $W\in\mathbb{R}^{m\times r}_{+}$ and $H\in\mathbb{R}^{r\times n}_{+}$ such that $M\approx WH$ . Typically, the columns of the input matrix $M$ correspond to data points (such as images of pixel intensities or documents of word counts) and NMF allows to perform linear dimensionality reduction. In fact, we have $M(:,j)\approx\sum_{k=1}^{r}W(:,k)H(k,j)$ for all $j$ , where $M(:,j)$ denotes the $j$ th column of $M$ . This means that the data points are approximated by points within an $r$ -dimensional subspace spanned by the columns of $W$ . The nonnegativity constraints lead to easily interpretable factors with applications such as image processing, text mining, hyperspectral unmixing and audio source separation; see for example the recent survey [14] and the references therein.

NMF is NP-hard in general [47] and its solution is in most cases not unique; see [14] and the references therein. These two issues motivated the introduction of the separability assumption as a way to solve NMF efficiently and have unique solutions. A matrix $M\in\mathbb{R}^{m\times n}$ is $r$ -separable if there exist an index set $\mathcal{K}$ of cardinality $r$ and a nonnegative matrix $H$ such that $M=M(:,\mathcal{K})H$ , where $M(:,\mathcal{K})$ is the matrix containing the columns of $M$ with index in $\mathcal{K}$ . This means that there exists an NMF $(W,H)$ such that each column of $W$ is equal to a column of $M$ . Given a matrix $M$ that satisfies the separability condition, computing $W=M(:,\mathcal{K})$ and $H$ can be done efficiently; see for example [33, 18] and the references therein. The corresponding problem is referred to as separable NMF.

Let us present an equivalent definition of separability that will be particularly useful in this paper. It was originally proposed in [11, 42, 10] in order to design convex formulations for separable NMF. The matrix $M$ is $r$ -separable if there exist some permutation matrix $\Pi\in\{0,1\}^{n\times n}$ and a nonnegative matrix $H^{\prime}\in\mathbb{R}^{r\times(n-r)}_{+}$ such that

[TABLE]

where $I_{r}$ is the $r$ -by- $r$ identity matrix and $0_{r,p}$ is the matrix of all zeros of dimension $r$ by $p$ . In fact, under an appropriate permutation, the first $r$ columns of $M$ correspond to the columns of $W$ while the last $n-r$ columns are convex combinations of these first $r$ columns. Equivalently, we have

[TABLE]

Convex formulations were obtained by trying to find a matrix $X$ such that (i) $M\approx MX$ and (ii) $X$ has as many zero rows as possible [11, 42, 10, 21]; see Section 3 for more details.

Note that every $m$ -by- $n$ nonnegative matrix is $n$ -separable since $M=MI_{n}$ hence it is important to find the minimal $r$ . Geometrically, in noiseless conditions, the minimal $r$ is the number of extreme rays of the cone generated by the columns of $M$ .

Under the separability assumption, NMF can be solved in polynomial time. This has been known and used for a long time in the hyperspectral imaging and signal processing communities [43, 37, 8, 7];see also [33] and the references therein. Furthermore, separable NMF can still be solved in polynomial time in the presence of noise [3, 4], and many robust algorithms have been proposed recently [42, 2, 16, 23, 17]. The separability assumption makes sense in several practical situations. In hyperspectral unmixing, each column of the data matrix is the spectral signature of a pixel. Separability requires that for each material wtihin the hyperspectral image, there exists a pixel that contains only that material; see for example [33]. In audio source separation, the input matrix is the time frequency amplitude spectrogram [12]. Separability requires that for each source, there exists a moment in time when only that source is active (or, considering the matrix transpose, separability requires that, for each source, there is a frequency for which only that source is active).

In document classification, each entry $M(i,j)$ of matrix $M$ indicates the importance of word $i$ in document $j$ (e.g., the number of occurrences of word $i$ in document $j$ ). Separability of $M$ (that is, each column of $W$ appears as a column of $M$ ) requires that, for each topic, there exists at least one document only discuss that topic (a “pure” document). Separability of $M^{T}$ (that is, each row of $H$ appears as a row of $M$ ) requires that, for each topic, there exists at least one word used only in that topic[5] (a “pure” word, referred to as an anchor word).

In this paper, we generalize the separability assumption as follows.

Definition 1.

A matrix $M\in\mathbb{R}_{+}^{m\times n}$ is $(r_{1},r_{2})$ -separable if there exist an index set $\mathcal{K}_{1}$ of cardinality $r_{1}$ and an index set $\mathcal{K}_{2}$ of cardinality $r_{2}$ , and nonnegative matrices $P_{1}\in\mathbb{R}_{+}^{r_{1}\times n}$ and $P_{2}\in\mathbb{R}_{+}^{m\times r_{2}}$ such that

[TABLE]

where $P_{1}(:,\mathcal{K}_{1})=I_{r_{1}}$ and $P_{1}(\mathcal{K}_{2},:)=I_{r_{2}}$ .

We will refer to such matrices as generalized separable (GS) matrices, with their corresponding GS decomposition (2).

The $(r_{1},r_{2})$ -separability is a natural extension of $r$ -separability since a matrix is $r$ -separable if and only if it is $(r,0)$ -separable. Note that every $m$ -by- $n$ nonnegative matrix $M$ is $(n,0)$ - and $(0,m)$ -separable.

Generalized separability makes sense in several applications. In document classification, it requires that for each topic, there exists either

•

a document discussing only that topic (a “pure” document), or

•

a word used only by that topic (an anchor word).

This is a much more relaxed condition than separability which needs a “pure” document for each topic, or an anchor word for each topic when considering the matrix transpose.

In a GS decomposition, the $r_{1}$ columns of $M$ indexed by $\mathcal{K}_{1}$ are $r_{1}$ pure documents, and the $r_{2}$ rows of $M$ indexed by $\mathcal{K}_{2}$ are $r_{2}$ anchor words, for a total of $r=r_{1}+r_{2}$ topics. The conditions $P_{1}(:,\mathcal{K}_{1})=I_{r_{1}}$ and $P_{1}(\mathcal{K}_{2},:)=I_{r_{2}}$ are rather natural and mean that the pure documents and anchor words are represented by themselves.

In terms of audio source separation [12], generalized separability requires that for each source there exists either

•

a moment in time where only that source is active, or

•

a frequency where only that source has a positive signature.

Again, this condition is much more relaxed than separability.

1.1 Related problems

GS-NMF is related to the CUR decomposition and the pseudo-skeleton approximation. Given a matrix $M$ , these techniques try to identify a subset of columns $\mathcal{K}_{1}$ and rows $\mathcal{K}_{2}$ of $M$ such that $||M-M(:,\mathcal{K}_{1})UM(\mathcal{K}_{2},:)||_{F}$ is as small as possible. In the CUR decomposition, $U$ is chosen so as to minimize the approximation error, that is, $U=M(:,\mathcal{K}_{1})^{\dagger}MM(\mathcal{K}_{2},:)^{\dagger}$ , where $A^{\dagger}$ denotes a Moore-Penrose generalized inverse of the matrix $A$ [34]. In the skeleton approximation, $U=M(\mathcal{K}_{2},\mathcal{K}_{1})^{-1}$ [24]. We refer the reader to [36] and the references therein for more information on these models. Since these models do not take nonnegativity into account, their analysis is rather different than GS-NMF. For example, to obtain exact decompositions of a rank- $r$ matrix, they can pick any subset of $r$ linearly independent rows and columns; this is not true for GS-NMF.

GS-NMF is also related to a model introduced in [32] and referred to as latent low-rank representation (LatLRR). The goal of LatLRR is to use the representation $M\approx MX+YM$ where both $X$ and $Y$ have low-rank. Intuitively, $M$ is represented using a subspace of the column space of $M$ (namely, $MX$ ) and a subspace of the row space (namely, $YM$ ). To achieve this goal, Liu and Yan [32] minimize the nuclear norms of $X$ and $Y$ (since minimizing the rank is hard in general) and apply their model on facial images. GS-NMF clearly shares some similarity with LatLRR. In fact, our convex model introduced in Section 3 will also use the representation $M=MX+YM$ but the constraints on $X$ and $Y$ will be rather different. Moreover, GS-NMF takes nonnegativity into account, and it is more interpretable as the basis used to reconstruct $M$ cannot be any linear combinations of the columns/rows of $M$ as in LatLRR, but need to be a subset of these columns/rows. For example, when applied on facial images (see Section 5.3 for numerical experiments), our model will identify important pixels and images within a data set (meaning that they can be used to approximate well all other images) while LatLRR identifies important linear combinations of pixels and images which is more difficult to interpret.

1.2 Outline and contribution of the paper

In this paper, we consider the NMF problem under the generalized separability condition, referred to as generalized separable NMF (GS-NMF).

In Section 2, we provide an equivalent characterization of GS matrices, similarly as done for separable matrices in (1). This leads to an idealized model to tackle GS-NMF. We then present several properties of GS matrices. We present a class of $m$ -by- $n$ matrices which are not $(n-1,0)$ - nor $(0,m-1)$ -separable but that are $(3,3)$ -separable. This illustrates the fact that GS decompositions can be much more compact than separable ones, requiring much fewer rank-one factors to reconstruct the input matrix. We also discuss non-uniqueness issues of GS-NMF, a problem which is not present for separable NMF. In Section 3, we propose a convex optimization model to tackle GS-NMF. It is the generalization of the models proposed in [42, 21] for separable NMF. Then, we implement a fast gradient method to takle this model that will allow us to tackle GS-NMF, similarly as done in [22] for separable NMF. Unfortunately, this model requires the use of $n^{2}+m^{2}$ variables hence is computationally rather expensive and does not scale well for large data sets. In Section 4, we propose a heuristic algorithm inspired by the successive projection algorithm (SPA) [1, 23] which we refer to as the generalized successive projection algorithm (GSPA) and that requires $O(mnr)$ operations as for most NMF algorithms.

In Section 5, we perform extensive numerical experiments on synthetic, document and image data sets. In most cases, we will observe that GS-NMF algorithms are able to compute decompositions with the same number of rank-one factors but with a lower approximation error than separable NMF algorithms.

2 Properties of GS Matrices

Let us first show a simple property.

Property 1 (Pattern of zeros).

Let $M\in\mathbb{R}^{m\times n}_{+}$ be $(r_{1},r_{2})$ -separable as described in Definition 1 so that $M=M(:,\mathcal{K}_{1})P_{1}+P_{2}M(\mathcal{K}_{2},:)$ . Then $M(\mathcal{K}_{2},\mathcal{K}_{1})=0_{r_{2},r_{1}}$ .

Proof.

According to (2), we have

[TABLE]

since, by definition, $P_{1}(:,\mathcal{K}_{1})=I_{r_{1}}$ and $P_{2}(\mathcal{K}_{2},:)=I_{r_{2}}$ . This implies that $M(\mathcal{K}_{2},\mathcal{K}_{1})=0$ . ∎

Intuitively, in terms of topic modeling for example, Property 1 means that a pure document about a topic cannot contain an anchor word from another topic.

Let us provide two equivalent characterization of GS matrices.

Property 2 (Equivalent characterization 1).

A matrix $M\in\mathbb{R}^{m\times n}_{+}$ is $(r_{1},r_{2})$ -separable if and only if it can be written as

[TABLE]

for some permutations matrices $\Pi_{c}\in\{0,1\}^{n\times n}$ and $\Pi_{r}\in\{0,1\}^{m\times m}$ , and for some nonnegative matrices $W_{1}\in\mathbb{R}^{(m-r_{2})\times r_{1}}_{+}$ , $H_{1}\in\mathbb{R}^{r_{1}\times(n-r_{1})}_{+}$ , $W_{2}\in\mathbb{R}^{(m-r_{2})\times r_{2}}_{+}$ and $H_{2}\in\mathbb{R}^{r_{2}\times(n-r_{1})}_{+}$ .

Proof.

This follows directly from Property 1 and Definition 1. The permutation $\Pi_{c}$ is chosen such that it moves the columns of $M$ corresponding to $\mathcal{K}_{1}$ in the first $r_{1}$ positions, and the permutation $\Pi_{r}$ is chosen such that it moves the rows of $M$ corresponding to $\mathcal{K}_{2}$ in the last $r_{2}$ positions. After these permutations, there is $r_{2}$ -by- $r_{1}$ block of zeros at the bottom left of $\Pi_{r}^{T}M\Pi_{c}^{T}$ since $M(\mathcal{K}_{2},\mathcal{K}_{1})=0$ (Property 1). Moreover, since $M=M(:,\mathcal{K}_{1})P_{1}+P_{2}M(\mathcal{K}_{2},:)$ for some nonnegative matrices $P_{1}$ and $P_{2}$ , we can take $W_{1},H_{1},W_{2}$ and $H_{2}$ such that $M(:,\mathcal{K}_{1})=[W_{1};0_{r_{2},r_{1}}]$ , $P_{1}=[I_{r_{1}}\,H_{1}]\Pi_{c}$ , $M(\mathcal{K}_{2},:)=[0_{r_{2},r_{1}}\,H_{2}]$ , and $P_{2}=\Pi_{r}[W_{2};I_{r_{2}}]$ . ∎

As explained in the introduction, a matrix $M$ is $r$ -separable if and only if it can be written as $M=MX$ where $X$ has $r$ non-zero rows. A similar characterization is possible for GS matrices.

Property 3 (Equivalent characterization 2).

A matrix $M\in\mathbb{R}^{m\times n}_{+}$ is $(r_{1},r_{2})$ -separable if and only if it can be written as

[TABLE]

where

[TABLE]

for some permutations matrices $\Pi_{c}\in\{0,1\}^{n\times n}$ and $\Pi_{r}\in\{0,1\}^{m\times m}$ , and for some $H_{1}\in\mathbb{R}^{r_{1}\times(n-r_{1})}_{+}$ and $W_{2}\in\mathbb{R}^{(m-r_{2})\times r_{2}}_{+}$ .

Proof.

By Property 2, the matrix $M$ is $(r_{1},r_{2})$ -separable if and only if there exist some permutation matrices $\Pi_{c}\in\{0,1\}^{n\times n}$ and $\Pi_{r}\in\{0,1\}^{m\times m}$ such that

[TABLE]

for some $W_{1}\in\mathbb{R}^{(m-r_{2})\times r_{1}}_{+}$ , $H_{1}\in\mathbb{R}^{r_{1}\times(n-r_{1})}_{+}$ , $W_{2}\in\mathbb{R}^{(m-r_{2})\times r_{2}}_{+}$ , $H_{2}\in\mathbb{R}^{r_{2}\times(n-r_{1})}_{+}$ . Letting $\tilde{M}=\Pi^{T}_{r}M\Pi^{T}_{c}$ hence $M=\Pi_{r}\tilde{M}\Pi_{c}$ , we have that $\tilde{M}$ is equal to

[TABLE]

∎

In practice, given a GS matrix, it is important to decompose it as a $(r_{1},r_{2})$ -separable matrix with minimal value for $r_{1}+r_{2}$ since this compresses the data the most. In the following, minimal $(r_{1},r_{2})$ -separable matrices are defined.

Definition 2.

A matrix $M$ is a minimal $(r_{1},r_{2})$ -separable if $M$ is $(r_{1},r_{2})$ -separable and $M$ is not $(r^{\prime}_{1},r^{\prime}_{2})$ -separable for any $r_{1}^{\prime}+r_{2}^{\prime}<r_{1}+r_{2}$ .

By property 3, finding minimal GS decompositions is equivalent to finding $X$ and $Y$ that satisfy (9) and such that the number of non-zero rows of $X$ and non-zero columns of $Y$ is minimized.

Property 4 (Idealized model).

Let $M$ be minimal $(r_{1},r_{2})$ -separable, and let $(X^{*},Y^{*})$ be an optimal solution of

[TABLE]

where $\|X\|_{row,0}$ is equal to the number of nonzero rows of $X$ and $\|Y\|_{col,0}$ is equal to the number of nonzero columns of $Y$ . Let also $\mathcal{K}_{1}$ correspond to the indices of the non-zero rows of $X^{*}$ and $\mathcal{K}_{2}$ to the indices of the non-zero columns of $Y^{*}$ . If $\operatorname{rank}(M)=r_{1}+r_{2}$ , we have

[TABLE]

Proof.

By Property 3, an $(r_{1},r_{2})$ -separable matrix can be written as $M=MX+YM$ where the number of non-zero rows of $X$ and non-zero columns of $Y$ is $r_{1}+r_{2}$ . Hence, by optimality of $(X^{*},Y^{*})$ , we have $|\mathcal{K}_{1}|+|\mathcal{K}_{2}|\leq r_{1}+r_{2}$ .

Moreover, since $\operatorname{rank}(M)=r_{1}+r_{2}$ , we must have $|\mathcal{K}_{1}|+|\mathcal{K}_{2}|\geq r_{1}+r_{2}$ . ∎

Some remarks are in order:

•

As opposed to separable NMF, due to the non-uniqueness of GS-NMF, $|\mathcal{K}_{1}|$ is not necessarily equal to $r_{1}$ and $|\mathcal{K}_{2}|$ to $r_{2}$ ; see Property 9 below.

•

Unfortunately, solving (11) does not guarantee $X$ and $Y$ to have the form (9) where $X$ and $Y$ contain the identify matrix as a submatrix. This is why we need the condition $\operatorname{rank}(M)=r_{1}+r_{2}$ .

•

Of course, (11) is a difficult combinatorial problem. We will consider in Section 3 a convex relaxation. Before doing that, we first present several other interesting properties of GS matrices.

From a practical point of view, GS matrices will be particularly interesting when they allow to compress the data significantly more than separable matrices. In other words, it would be interesting to know whether there exists $(r_{1},r_{2})$ -separable matrices which are not $(r,0)$ - nor $(0,r)$ -separable for $r\gg r_{1}+r_{2}$ . In fact, this is the case; see Property 5 for the case $r_{1}=r_{2}=3$ and $r=\min(m-1,n-1)$ . First let us show the following lemma.

Lemma 1.

There exist 3-by- $n$ matrices that are $(2,1)$ -separable but not $(n-1,0)$ -separable.

Proof.

Consider the 3-by- $n$ matrix

[TABLE]

where $x,y,z\in\mathbb{R}^{n-4}$ are such that $(x_{i},y_{i},z_{i})$ for $i=1,2,\dots,n-4$ are defined as $0<x_{i}<\frac{1}{2}$ and $x_{i}\neq x_{j}$ for all $i\neq j$ , $y_{i}=2(\frac{1}{2}-x_{i})^{2}$ , $z_{i}=1-x_{i}-y_{i}$ . The points $(x_{i},y_{i},z_{i})$ are distinct points on a curve on the unit simplex hence such points cannot be written as conic combinations of any other points on that curve. In fact, since the entries of the vectors $(x_{i},y_{i},z_{i})$ sum to one for $i=1,2,\dots,n-4$ , the weights in such a conic combination would also have to sum to one hence such a conic combination would actually be a convex combination. Clearly, distinct points on the circle $(x_{i},y_{i})$ ’s are not convex combination of one another; in other words, every such point is a vertex of their convex hull.

Note also that the third and fourth column of $M_{n}$ are the two extreme points of that curve. The first column of $M_{n}$ also cannot be written as a convex combination of all the other columns since $z_{i}\neq 0$ for all $i$ . This implies that $M_{n}$ is not $(n-1,0)$ separable: every column of $M_{n}$ is an extreme ray of the cone spanned by the columns of $M_{n}$ . Moreover, $M_{n}$ is $(2,1)$ -separable since $M_{n}(1:2,1:2)=I_{r}$ while the third row can be approximated by itself: we have

[TABLE]

where $P_{1}=M_{n}(1:2,:)$ , $P_{2}=(0,0,1)^{T}$ . ∎

Property 5 (Compression).

There exist $m$ -by- $n$ matrices that are $(3,3)$ -separable but not $(n-1,0)$ - nor $(0,m-1)$ -separable.

Proof.

Let $M_{n}$ be a 3-by- $n$ matrix and $M_{m}$ be a 3-by- $m$ matrix constructed as in (12). Let us also construct the $(m+3)$ -by- $(n+3)$ matrix

[TABLE]

By Lemma 1, $M$ is (3,3)-separable (note that the corresponding GS decomposition is not unique since $M_{n}$ is (2,1)- and (0,3)-separable), while not being ( $n+2$ ,0)-separable nor (0, $m+2$ )-separable. In fact, assume $M$ is (0, $m+2$ )-separable. Observe that any row that would be selected from the first $3$ rows (resp. last $m$ rows) of $M$ cannot be used to reconstruct any of the last $m$ rows (resp. first $3$ rows) of $M$ using a positive weight because of the zeros in the last positions (resp. in the first positions). Hence a (0, $m+2$ )-separable decomposition of $M$ would imply that either $M_{m}^{T}$ is (0, $m-1$ )-separable, a contradiction with Lemma 1, or that $M_{n}$ is (0,2)-separable which is not possible since $\operatorname{rank}(M_{n})=3$ . The same observation holds for the columns, by symmetry of the problem. ∎

The next property is rather straightforward but we state here for completeness. It shows that generalized separability is invariant to scaling.

Property 6 (Scaling).

The matrix $M$ is $(r_{1},r_{2})$ -separable if and only if $D_{1}MD_{2}$ is $(r_{1},r_{2})$ -separable for any diagonal matrices $D_{1}$ and $D_{2}$ whose diagonal elements are positive.

Proof.

Let $M$ be $(r_{1},r_{2})$ -separable with $M=M(:,\mathcal{K}_{1})P_{1}+P_{2}M(\mathcal{K}_{2})$ with $|\mathcal{K}_{1}|=r_{1}$ and $|\mathcal{K}_{2}|=r_{2}$ . Multiplying on both sides by $D_{1}$ and $D_{2}$ , we obtain

[TABLE]

Denoting $\tilde{M}=D_{1}MD_{2}$ , $\tilde{P}_{1}=D_{2}(\mathcal{K}_{1},\mathcal{K}_{1})^{-1}P_{1}D_{2}$ and $\tilde{P}_{2}=D_{1}P_{2}D_{1}(\mathcal{K}_{2},\mathcal{K}_{2})^{-1}$ , we have

[TABLE]

where $\tilde{P}_{1}(\mathcal{K}_{1},\mathcal{K}_{1})=I_{r_{1}}$ and $\tilde{P}_{2}(\mathcal{K}_{2},\mathcal{K}_{2})=I_{r_{2}}$ ; hence $\tilde{M}$ is $(r_{1},r_{2})$ -separable. The proof in the other direction is the same since $\tilde{M}=D_{1}MD_{2}$ is the diagonal scaling of $M$ using the inverses of $D_{1}$ and $D_{2}$ . ∎

2.1 Unicity of GS decompositions

As opposed to separable NMF, GS-NMF does not necessarily admit a unique solution (up to scalings and permutations of the rank-one factors). In other words, for a minimal $(r_{1},r_{2})$ -separable matrix $M$ , the way of picking the rows and columns of $M$ is not necessarily unique: it may also be $(r_{3},r_{4})$ -separable with $r_{3}+r_{4}=r_{1}+r_{2}$ where $r_{1}\neq r_{3}$ and $r_{2}\neq r_{4}$ , or it can be $(r_{1},r_{2})$ -separable with different selection of rows and columns; this is the case for example for the matrix $M$ in (13).

The simplest cases are for rank-one and rank-two matrices.

Property 7 (Rank-one matrices).

Any nonnegative rank-one matrix is (1,0)- and (0,1)-separable.

Proof.

This follows directly from the fact that all rows (resp. columns) of a rank-one matrix are multiple of one another. ∎

Property 8 (Rank-two matrices).

Any nonnegative rank-two matrix is (2,0)- and (0,2)-separable.

Proof.

This follows from the fact that any nonnegative rank-two matrix is 2-separable [45]. The reason is that a two-dimensional cone is always spanned by its two extreme rays. ∎

Examples can be constructed for any values of $(r_{1},r_{2})$ .

Property 9 (Construction of non-unique minimal $(r_{1},r_{2})$ -separable matrices).

For any $(r_{1},r_{2})$ , we can construct minimal $(r_{1},r_{2})$ -separable matrices such that they are also minimal $(r_{3},r_{4})$ -separable with $r_{1}+r_{2}=r_{3}+r_{4}$ , $r_{3}\neq r_{1}$ and $r_{4}\neq r_{2}$ .

Proof.

Let $r_{1}>r_{3}$ , $r_{2}<r_{4}$ and $r_{1}+r_{2}=r_{3}+r_{4}$ . Let also $M_{11}\in\mathbb{R}^{(m-r_{4})\times r_{3}}_{+}$ , $M_{22}\in\mathbb{R}^{(r_{4}-r_{2})\times(r_{1}-r_{3})}_{+}$ and $M_{33}\in\mathbb{R}^{r_{2}\times(n-r_{1})}_{+}$ be any nonnegative matrices. Let us construct $M$ as follows:

[TABLE]

where

[TABLE]

for any $X_{1}\in\mathbb{R}^{r_{3}\times(n-r_{1})}_{+}$ , $Y_{1}\in\mathbb{R}^{(m-r_{4})\times r_{2}}_{+}$ , $X_{2}\in\mathbb{R}^{(r_{1}-r_{3})\times{(n-r_{1})}}_{+}$ , $Y_{2}\in\mathbb{R}^{(r_{4}-r_{2})\times r_{2}}_{+}$ .

We have that $M$ is $(r_{1},r_{2})$ -separable where $\mathcal{K}_{1}$ contains the first $r_{1}$ columns of $M$ and $\mathcal{K}_{2}$ contains the last $r_{2}$ rows of $M$ since

[TABLE]

Similarly, we have that $M$ is $(r_{3},r_{4})$ -separable where $\mathcal{K}_{1}$ contains the first $r_{3}$ columns of $M$ and $\mathcal{K}_{2}$ contains the last $r_{4}$ rows of $M$ since $(0_{m-r_{4},r_{1}-r_{3}}\;M_{13})$ is equal to

[TABLE]

∎

The simplest example is for a 3-by-3 matrix that is $(2,1)$ - and $(1,2)$ -separable, and also trivially $(3,0)$ - and $(0,3)$ -separable:

[TABLE]

We simply took $M_{11}=M_{22}=M_{33}=X_{1}=X_{2}=Y_{1}=Y_{2}=1$ in Property 9.

However, it is possible to guarantee uniqueness of GS decompositions. A possible way is to have a single pattern of zeros which is large enough.

Property 10 (Condition for uniqueness).

Let $M$ be minimal $(r_{1},r_{2})$ -separable and let $M$ not be $(r_{1}+r_{2},0)$ -separable nor $(0,r_{1}+r_{2})$ -separable. If $M$ does not contain a pattern of zeros of size $r_{3}r_{4}$ with $r_{1}+r_{2}=r_{3}+r_{4}$ , except for $M(\mathcal{K}_{2},\mathcal{K}_{1})=0$ , then $M$ admits a unique GS decomposition of size $(r_{1},r_{2})$ .

Proof.

This follows directly from Property 1. ∎

Identifiability is a key aspect of NMF models; see the recent survey [14] on this topic and the references therein. Let us discuss this aspect and how GS-NMF allows to resolve this issue (given that the input matrix is in fact a GS matrix). Given a nonnegative matrix $X=WH$ for $W\geq 0$ and $H\geq 0$ , the conditions for such a factorization to be unique, up to permutation and scaling of the rank-one factors $W(:,k)H(k,:)$ ( $1\leq k\leq r$ ), are rather strong and in general not met in practice. In a few words, these conditions require $W$ and $H$ to be sufficiently sparse. Therefore, in practice, it is crucial to consider reglularized NMF models; for example adding sparsity constraints on the factors $W$ and/or $H$ [25]. A key NMF model that leads to identifiability of NMF under very mild conditions is minimum-volume NMF (MV-NMF)111In fact, the sufficiently scattered condition which is sufficient to guarantee MV-NMF to be identifiable is conjectured to also be necessary [14]. which requires the convex hull of the columns of $W$ to have the smallest possible volume while the columns of $H$ are normalized to have unit $\ell_{1}$ norm [35]. Under this model, NMF is identifiable given that the matrix $H$ is sufficiently sparse; this condition is referred to as the sufficiently scattered condition [26, 31, 13, 14]. We will compare MV-NMF with GS-NMF in the numerical experiments (Section 5) and observe that GS-NMF is able to recover the true factors $W$ and $H$ that generated the GS matrix while standard NMF and MV-NMF fail to do so. This means that GS-NMF really brings a new class of identifiable NMF solutions. The reason is that GS matrices satisfy different conditions than the sufficiently scattered condition. This allows proper algorithms, like the ones we will propose in the next sections, to take advantage of these properties hence making them able to recover the true $W$ and $H$ . We believe that this is a key asset of GS-NMF. For example, in audio source separation, it is reasonable to assume that the input matrix is a GS matrix (see the Introduction), while it might not be separable nor satisfy the sufficiently scattered condition.

3 Convex Optimization Model and Fast Gradient Method

In real data sets, due to the presence of noise (and model misfit), the model (11) should be modified to

[TABLE]

where $\epsilon$ denotes the noise level. The norm of the residual $\|M-MX-YM\|$ can be chosen according to the noise statistic. In this paper, we will consider the Frobenius norm, that is, $\|M-MX-YM\|_{F}=\sum_{i,j}(M-MX-YM)_{i,j}^{2}$ ; see for example [9] for a discussion on the choice of the objective function.

3.1 Convex optimization model

As it is challenging to solve (16), it can be relaxed to a convex optimization model as follows:

[TABLE]

where $\|X\|_{1,q}:=\sum^{n}_{i=1}\|X(i,:)\|_{q}$ and $\|Y^{T}\|_{1,q}:=\sum^{n}_{i=1}\|Y(:,i)\|_{q}$ . The quantities $\|X\|_{1,q}$ and $\|Y^{T}\|_{1,q}$ are the $\ell_{1}$ norm of the vector containing the $l_{q}$ norms of the rows of $X$ and the columns of $Y$ , respectively. The model aims to generate a matrix $X$ with only a few non-zero rows and a matrix $Y$ with only a few non-zero columns. This model is a generalization of separable NMF convex relaxations: $q=2$ was proposed in [10], while $q=+\infty$ was proposed in [11]. In fact, (17) coincides with the models from [11, 10] by taking $Y=0$ .

The rationale behind this model is that the $\ell_{1}$ norm is the largest convex function smaller than $\ell_{0}$ norm on the $\ell_{\infty}$ ball; see for example [41]. In other terms, $\|X\|_{1,q}\leq\|X\|_{row,0}$ as long as $\|X(i,:)\|_{q}\leq 1$ for all $i$ .

Considering $q=+\infty$ , $\|X(i,:)\|_{q}\leq 1$ holds for example for $X\leq 1$ . This can be assumed without loss of generality given that the input matrix is properly scaled.

Definition 3.

The matrix $M$ is scaled if $||M(:,j)||_{1}=k_{1}$ for all $j$ and $||M(i,:)||_{1}=k_{2}$ for all $i$ , for some $nk_{1}=mk_{2}>0$ .

Given a nonnegative matrix $M$ , it is in most cases possible to scale it, that is, find diagonal matrices $D_{r}$ and $D_{c}$ such that $M_{s}=D_{r}MD_{c}$ is scaled. It requires that the matrix $M$ has sufficiently many non-zero elements. When the matrix is scalable, the algorithm that alternatively scales the columns and rows of $M$ will converge to a scaled matrix. We refer the reader to [27, 40] for more details on this topic.

We have the following property.

Property 11.

Let $M$ be a scaled $(r_{1},r_{2})$ -separable matrix. Then $M$ can be decomposed as in (9) with

[TABLE]

Proof.

Using Property 2 we have, after proper permutations of the columns and rows of $M$ , that

[TABLE]

Since $M$ is scaled, we have $e^{T}M(:,j)=e^{T}W_{1}(:,j)=k_{1}$ for $1\leq j\leq r_{1}$ , where $e$ is the vector of all one of appropriate dimension. For $j=r_{1}+1,\dots,n$ , we have

[TABLE]

since all matrices involved are nonnegative. This implies that $H_{1}\leq 1$ . In fact, this implies the stronger condition $||H_{1}(:,j)||_{1}\leq 1$ for all $j$ . By symmetry, the same result holds for $W_{2}$ , that is, $W_{2}\leq 1$ and $||W_{2}(i,:)||_{1}\leq 1$ for all $i$ . Therefore, up to permutations, using the same derivations as in Property 3, we have

[TABLE]

where $H_{1}\leq 1$ and $W_{2}\leq 1$ . ∎

In this paper, we focus on another convex model to tackle GS-NMF. For a scaled GS matrix, it can be written as follows:

[TABLE]

This model is the generalization of the model from [16] for separable matrices, which is an improvement of the model from [42]. The rationale behind this model is the following. Since $X$ is nonnegative, minimizing its trace is equivalent to minimize the $\ell_{1}$ norm of its diagonal entries, that is, $\operatorname{trace}(X)=||\operatorname{diag}(X)||_{1}$ . Hence (18) promotes solutions whose diagonal is sparse. Then, the constraints $X(i,j)\leq X(i,i)$ for all $i,j$ impose that the largest entry in each row is the corresponding diagonal entry. Hence, if a diagonal entry is equal to zero, the entire row is zero. This makes this model generate solutions that tend to be row sparse. Note that for any feasible solution $X$ of (18), we have $\operatorname{trace}(X)\leq\|X\|_{row,0}$ . In fact, since $0\leq X(i,j)\leq X(i,i)$ for all $i,j$ , $\|X\|_{1,\infty}=\operatorname{trace}(X)$ for any $X$ . Moreover, since $X\leq 1$ , $\|X\|_{1,\infty}\leq\|X\|_{row,0}$ . By symmetry, we also have $\operatorname{trace}(Y)\leq\|Y\|_{col,0}$ .

Unfortunately, as opposed to the model for separable matrices, we were not able to show that (18) is provably able to recover the correct set of column and row indices, even in the presence of low-noise levels. This is an important direction of future research. However, in the numerical experiments in Section 5, this model performs this task perfectly in all tested scenarios (see Figures 1 and 4).

The model (18) can easily be generalized for non-scaled $M$ ; see Section 3.2. Compared to (17), it has an important advantage: It is a smooth optimization problem, and the projection onto the feasible set can be performed efficiently, in $\mathcal{O}(n^{2}\log n+m^{2}\log m)$ operations [22]. Therefore, we can easily design first-order optimization method with strong convergence guarantees; see Section 3.2.

3.2 Fast Gradient Method for GS-NMF

Let us generalize the model (18) to non-scaled matrices, as done for separable matrices in [21]. Using essentially a similar argument as in the proof of Property 11, we have for a GS matrix $M$ that for all $j$

[TABLE]

since $M=M(:,\mathcal{K}_{1})P_{1}+P_{2}M(\mathcal{K}_{2},:)$ . Taking the $\ell_{1}$ norm on both sides, we have

[TABLE]

A similar observation can be made for $P_{2}$ , which leads to the generalization of (18) for non-scaled matrices:

[TABLE]

where the sets $\Omega_{1}$ and $\Omega_{2}$ are defined as

[TABLE]

where the vector ${w}\in\mathbb{R}^{n}_{+}$ contains the $l_{1}$ norm of the columns $M$ , that is, $w_{j}=\|M(:,j)\|_{1}$ for all $j=1,\dots,n$ , and the vector $\hat{w}\in\mathbb{R}^{m}_{+}$ contains the $l_{1}$ norm of the rows of $M$ , that is, $\hat{w}_{l}=\|M(l,:)\|_{1}$ for all $l=1,\cdots,m$ .

To solve the smooth convex problem (19), interior-point methods can be used for example using SDPT3 [46]. However using such second-order method to solve (19) which has $n^{2}+m^{2}$ variables and as many constraints would be numerically expensive. Moreover, in our case, high accuracy solutions are not crucial: the main goal of solving (19) is to identify the important columns and rows of $M$ which correspond to the largest entries in the diagonal entries of $X$ and $Y$ . Therefore, we use Nesterov’s optimal first-order method [38, 39], namely, a fast gradient method, similarly as done in [22] for separable matrices. Here “fast” refers to the fact that it attains the best possible convergence rate of $\mathcal{O}(1/k^{2})$ in the first-order regime. To do so, we consider the penalized version of (19):

[TABLE]

with $F(X,Y)=\frac{1}{2}\|M-MX-YM\|^{2}_{F}+\lambda\big{(}\operatorname{trace}(X)+\operatorname{trace}(Y)\big{)}$ , where $\lambda>0$ is a penalty parameter which balances the importance between the approximation error $\|M-MX-YM\|^{2}_{F}$ and the sum of the traces of $X$ and $Y$ .

To initialize $X$ and $Y$ and set the value of $\lambda$ , we adopt the following strategy described in Algorithm 1, similarly as in [22]:

•

Extract a subset $\mathcal{K}_{1}$ of columns and a subset $\mathcal{K}_{2}$ of rows of $M$ such that $|\mathcal{K}_{1}|+\mathcal{K}_{2}=r$ using the heuristic algorithm referred to as GSPA; see Section 4.

•

Compute the corresponding optimal weights $(P_{1}^{*},P_{2}^{*})$ which is the solution to

[TABLE]

We used the coordinate descent implemented in [19].

•

Define $X_{0}(\mathcal{K}_{1},:)=P_{1}^{*}$ and $Y_{0}(:,\mathcal{K}_{2})=P_{2}^{*}$ , while $X_{0}(i,:)=0$ for $i\notin\mathcal{K}_{1}$ and $Y_{0}(:,j)=0$ for $j\notin\mathcal{K}_{2}$ .

•

Set $\lambda=\tilde{\lambda}\frac{\|M-MX_{0}-Y_{0}M\|}{2r}$ , where $r=r_{1}+r_{2}$ and some $\tilde{\lambda}$ . Typically, $\tilde{\lambda}\in[10^{-3},10]$ works well.

To solve model (20), we employ Algorithm 2 which is an optimal first-order method to minimize $F(X,Y)$ over the sets $\Omega_{1}$ and $\Omega_{2}$ . To compute the Euclidean projection of $X$ on the set $\Omega_{1}$ and of $Y$ on the set $\Omega_{2}$ , we use the method proposed in [22], which only requires $\mathcal{O}(n^{2}\log n)$ and $\mathcal{O}(m^{2}\log m)$ operations for the projection of $X$ and $Y$ , respectively.

The main computational cost of Algorithm 2 resides in lines 2, 7 and 9. For line 2, the maximum singular value of $M$ can be well approximated by the power method which needs $\mathcal{O}(mn)$ operations. In line 5, the computation of the different matrix products require $\mathcal{O}(mn^{2}+m^{2}n)$ operations. For line 7, the projections of $X$ and $Y$ require $\mathcal{O}(n^{2}\log n+m^{2}\log m)$ [22]. Finally, Algorithm 2 requires $\mathcal{O}(mn^{2}+m^{2}n)$ operations, assuming $m\geq\log n$ and $n\geq\log m$ .

We will refer to Algorithm 2 as GS-FGM. Note that the numbers $r_{1}$ and $r_{2}$ are given as input of Algorithm 2. However, they can also be detected automatically by identifying the entries on the diagonals of $X$ and $Y$ above a certain threshold. For simplicity, we will use the same two post-processing procedures as in [22]:

•

For synthetic data sets, we simply pick the $r_{1}$ largest entries of the diagonals of $X$ and the $r_{1}$ largest entries of the diagonals of $Y$ .

•

For real data sets, it is also important to consider off-diagonal entries of $X$ and $Y$ . The reason is that the input matrix can be far from being a GS matrix. For example, an outlying column will in general lead to a large diagonal entry in $X$ (since an outlier is in general not well approximated with other data points) while the other entries on the same row will be close to zero (since an outlier is in general useless to reconstruct other data points). This means that if a row of $X$ has many large entries, it is likely to be more important than a row with only a large diagonal entry. For this reason, we sort the columns of $M$ by applying SPA on $X^{T}$ as done in [22]; an similarly for $Y$ to sort the rows of $M$ . It remains to decide how many column and row indices to pick in each of these ordered sets. To do so, we sequentially select a column or a row of $M$ as follows: at each step, we will select the column/row of $M$ such that the residual after projection onto its orthogonal complement is the smallest, and at the next step, we replace $M$ by the corresponding residual; this shares some similarity with the algorithm presented in the next section.

4 Heuristic Algorithm for GS-NMF

Algorithm 2 is computationally expensive, and does not scale linearly with the dimension of the input matrix. For large-scale problems, it would not be applicable. When running on a standard computer, $m$ and $n$ should be limited to values below a thousand. A possible way to overcome this issue is to preselect, a priori, a subset of columns and rows of $M$ , reducing the number of variables; see Section 5.2 for a discussion.

In this section, we derive a fast heuristic algorithm for GS-NMF. It is inspired from one of the most widely used separable NMF algorithm, namely the successive projection algorithm (SPA). SPA is essentially equivalent to QR with column pivoting; it was introduced in [1] in the contex of spectral unmixing but has been rediscovered many times; see the discussions in [33, 18]. Moreover, SPA is robust in the presence of noise [23]. SPA assumes that the input matrix has the form $M=M(:,\mathcal{K})[I_{r},H^{\prime}]\Pi$ where $\Pi$ is a permutation matrix and $H^{\prime}\geq 0$ and $||H^{\prime}(:,j)||_{1}\leq 1$ for all $j$ . This means that the columns of $M$ are in the convex hull of the columns of $M(:,\mathcal{K})$ ; in other words, the columns of $M(:,\mathcal{K})$ are the vertices of the convex hull of the columns of $M$ . We can identify a vertex of this convex hull using the $\ell_{2}$ norm as it must be maximized at a vertex. This is the main idea behind SPA which sequentially identifies the columns in $\mathcal{K}$ as follows: at each step, it first extracts the column of $M$ that has the largest $\ell_{2}$ norm and then project all columns of $M$ onto the orthogonal complement of the extracted column. Under the assumption that $M(:,\mathcal{K})$ is full column rank, SPA recovers the set $\mathcal{K}$ .

Algorithm 3 generalizes SPA in a straightforward manner; we refer to it as generalized SPA (GSPA). At each iteration, it identifies a column or a row of $M$ that will be used as a basis in a GS decomposition. Each iteration is made of two steps: First, it computes the norms of the columns of $M$ multiplied by $n$ and the norms of the rows of $M$ multiplied by $m$ , and selects the column/row corresponding to the largest value. Second, it projects the columns/rows of $M$ onto the orthogonal complement of the selected column/row.

One can check that the computational cost of GSPA is $\mathcal{O}(mnr)$ operations; the main operations being matrix-vector products. As for SPA, GSPA should be applied to a scaled GS matrix. Unfortunately, as opposed to SPA, there is no guarantee that a column (resp. row) with maximum $\ell_{2}$ norm will belong to the set $\mathcal{K}_{1}$ (resp. $\mathcal{K}_{2}$ ); see Example 1 below where we construct a particular GS matrix for which GSPA fails. Hence GSPA is a heuristic for GS-NMF. However, a topic for further research would be to show that GSPA works under suitable additional conditions, that is, for a subset of GS matrices. In fact, as we will see in the numerical experiments, GSPA works remarkably well for some randomly generated GS matrices.

Example 1.

Let us consider the following (2,2)-separable matrix

[TABLE]

$H_{2}=W_{1}^{T}$ , $W_{2}=H_{1}^{T}$ . For $\epsilon=0.001$ , we have

[TABLE]

Using SPA, one can check that $M$ is not (4,0)- nor (0,4)-separable. Since there is no pattern of zeros of dimension (1,3) or (3,1), it is not (1,3)- nor (3,1)-separable (see Property 1). Therefore, $\mathcal{K}_{1}=\{1,2\}$ and $\mathcal{K}_{2}=\{5,6\}$ is the only possible GS decomposition with $|\mathcal{K}_{1}|+|\mathcal{K}_{2}|=4$ . The scaled version of $M$ is

[TABLE]

The column with largest $\ell_{2}$ norm is the third which is not in $\mathcal{K}_{1}$ , and the row with the largest $\ell_{2}$ norm is the first which is not in $\mathcal{K}_{2}$ ; they both have the same norm. Therefore, GSPA fails: it returns $\mathcal{K}_{1}=\{1,2,3\}$ and $\mathcal{K}_{2}=\{5\}$ , or $\mathcal{K}_{1}=\{2\}$ and $\mathcal{K}_{2}=\{1,4,5\}$ (rows and columns of $M_{s}$ are the same up to permutations, because $H_{2}=W_{1}^{T}$ and $H_{1}=W_{2}^{T}$ ).

Note however that the matrix is almost (3,1)- and (1,3)-separable. In fact,

[TABLE]

Note also that the model (18) applied on $M_{s}$ identifies $X$ and $Y$ perfectly, with the form of (9).

5 Numerical Experiments

In this section, we conduct experiments on synthetic (Section 5.1), document (Section 5.2) and image data sets (Section 5.3) to test the performance of the proposed models. All experiments were run on Intel(R) Core(TM) i5-5200 CPU @2.20GHZ with 8GB of RAM using Matlab.

Since GS-NMF has not been considered before, we cannot compare GS-FGM and GSPA to existing GS-NMF algorithms. Instead, we consider a state-of-the-art separable NMF algorithm, namely, the successive projection algorithm (SPA); see the description in Section 4.

Separable NMF algorithms such as SPA can only identify a subset of the columns of the input matrix $M$ . Hence, we consider the following three possibilities:

SPA is applied on $M$ to identify $r_{1}$ important columns of $M$ , and then on $M^{T}$ to identify $r_{2}$ important rows of $M$ . We refer to this variant as SPA*. Note that this is another heuristic to tackle GS-NMF. It is rather different than GSPA that only requires $r$ as an input and identifies automatically the number of columns and rows to extract; see Algorithm 3. 2. 2.

SPA is applied on $M$ to identify $r=r_{1}+r_{2}$ columns of $M$ . We refer to this variant as SPA-C. 3. 3.

SPA is applied on $M^{T}$ to identify $r=r_{1}+r_{2}$ rows of $M$ . We refer to this variant as SPA-R.

Although the last two approaches (namely, SPA-C and SPA-R) will not be able to tackle GS-NMF, it is interesting to include them in the comparison to see how much GS-NMF algorithms can reduce the approximation error compared to separable NMF algorithms.

Remark 1.

We have also considered other separable NMF algorithms combined with the above strategies; namely the successive nonnegative projection algorithm (SNPA) [17], XRAY [28] and FGNSR [22]. However, they provided results similar to SPA hence we do not show these results here for the simplicity of the presentation.

We have used a stopping criterion for GS-FGM based on the evolution of the iterates and the error: we stop GS-FGM when one of the following conditions holds:

[TABLE]

where $e(k)$ is the objective function at iteration $k$ , $Z^{(k)}=(X^{(k)},Y^{(k)})$ is the solution at iteration $k$ , and $0<\delta<1$ is a parameter. We will use $\delta=10^{-4}$ for synthetic data sets and $\delta=10^{-2}$ for the real data sets (documents and images).

We will also compare these algorithms to the following algorithms:

•

A state-of-the-art NMF algorithm, namely the accelerated hierarchical alternating least squares (A-HALS) algorithm [19]. We will refer to this algorithm as NMF.

•

A state-of-the-art minimum-volume NMF algorithm [15]. We use the improved implementation that uses a fast gradient method to solve the subproblems in $W$ and $H$ from [30]. We will refer to this algorithm as MV-NMF.

For both algorithm, we use the default parameters and perform 1000 iterations.

The code is available from https://sites.google.com/site/nicolasgillis/code.

5.1 Synthetic data sets

In this section, we compare the different algorithms on two types of synthetic data sets: fully randomly generated (Section 5.1.1), and the so-called middle-point experiment with adversarial noise (Section 5.1.2).

For GS-FGM, we identify the subsets $\mathcal{K}_{1}$ and $\mathcal{K}_{2}$ by using the $r_{1}$ largest diagonal entries of $X$ and $r_{2}$ largest diagonal entries of $Y$ , respectively. In all experiments, we run GS-FGM with the parameter $\tilde{\lambda}=0.25$ and maxiter = 1000.

Given the subsets $(\mathcal{K}_{1},\mathcal{K}_{2})$ computed by an algorithm, we will report the following three quality measures:

The accuracy, defined as

[TABLE]

where $\mathcal{K}^{*}_{1}$ and $\mathcal{K}^{*}_{2}$ are the true column and row indices used to generate $M$ . The accuracy reports the proportion of correctly identified row and column indices.

Note that the accuracy cannot be computed for NMF and MV-NMF that do not identify columns and rows of the input matrix. 2. 2.

The relative approximation error, defined as

[TABLE]

Note that we compute $P_{1}$ and $P_{2}$ using the coordinate descent method from [19]. 3. 3.

The distance to ground truth: given the solution $(W,H)$ of an algorithm, it is defined as

[TABLE]

where $\pi_{w}$ and $\pi_{h}$ are permutations, and $(W^{*},H^{*})$ is the ground truth that generated the noiseless input data $M=W^{*}H^{*}$ (see Definition 1). Note that for GS-NMF algorithms, $W=[M(:,\mathcal{K}_{1}),P_{2}^{*}]$ and $H=[P_{1}^{*};M(\mathcal{K}_{2},:)]$ where $P_{1}^{*}$ and $P_{2}^{*}$ are the solutions of (23).

5.1.1 Fully randomly generated data

We generate noisy (20,20)-separable matrices $M\in\mathbb{R}^{100\times 100}$ as follows:

[TABLE]

where

•

The entries of the matrices $W_{1}\in\mathbb{R}^{80\times 20}$ and $H_{2}\in\mathbb{R}^{20\times 80}$ are generated uniformly at random in the interval [0,1] using the rand function of MATLAB. $H_{1}\in\mathbb{R}^{20\times 80}$ and $W_{2}\in\mathbb{R}^{80\times 20}$ are generated using sparse uniformly distributed random matrices with the density equal to 50% (sprand(m,n,0.5) in Matlab).

•

The diagonal matrices $D_{r}$ and $D_{c}$ are computed so that $M^{s}$ is scaled; we use the algorithm that alternatively scales the columns and rows of the input matrix [27, 40].

•

The entries of the noise $N\in\mathbb{R}^{100\times 100}$ are generated uniformly at random with the normal distribution of mean 0 and standard deviation 1 using the randn function of MATLAB. The noise matrix $N$ is then normalized so that $||N||_{F}=\epsilon||M^{s}||_{F}$ , where $M^{s}$ is the noiseless scaled (20,20)-separable matrix, and $\epsilon$ is a parameter that relates to the noise level.

•

$\Pi_{r}$ and $\Pi_{c}$ are randomly generated permutation matrices.

We use 20 noise levels $\epsilon$ logarithmically spaced in $[10^{-3},1]$ (in Matlab, logspace(-3,0,20)). For each noise level, we generate 25 such matrices and report the average quality measures on Figures 1, 2 and 3.

We observe the following:

•

As expected, SPA-C and SPA-R have an accuracy of at most 50%, and perform very badly to recover the ground truth. Moreover, they also perform much worse than GS-FGM in terms of approximation error. This validates the GS-NMF model in the sense that it is able to reduce the approximation error compared to separable NMF for the same factorization rank.

•

In terms of accuracy, GS-FGM performs the best, having an accuracy of 100% for all $\epsilon\leq 0.483$ . Surprisingly, even for low-noise levels, GSPA is not able to recover exactly all column and row indices (see the zoomed-in graph on Figure 1). SPA∗ performs better than SPA-C and SPA-R, but much worse than GS-FGM and GSPA.

•

In terms of relative error, NMF performs similarly as GS-FGM. This is not surprising since NMF factorizes the input matrix with no other constraints than nonnegativity. It is actually nice to observe that GS-FGM produces solutions with the same relative error than NMF although this model is much more constrained; the reason is that the input data satisfies our assumption.

•

In terms of recovering the ground truth, GS-FGM outperforms all other algorithms, followed by GSPA. NMF and MV-NMF are not able to recover the ground truth due the non-uniqueness of the solution. This shows experimentally the advantage of using GS-NMF to have identifiability of the solution, given that the input matrix is close to being a GS matrix; see the discussion in Section 2.1.

In the next section, we construct more complicated synthetic data sets for which the behavior of the different algorithms is further highlighted.

5.1.2 Middle points and adversarial noise

In this section, we generate the noisy GS matrices exactly as in the previous section except that $m=78$ , $n=55$ , $r_{1}=10$ , $r_{2}=12$ , and

•

the $\binom{r_{1}}{2}=45$ columns $H_{1}$ (resp. $\binom{r_{2}}{2}=66$ rows of $W_{2}$ ) contain all possible combinations of two non-zero entries equal to $0.5$ at different positions. Hence, the columns of $W_{1}H_{1}$ (resp. rows of $W_{2}H_{2}$ ) are all the middle points of the columns of $W_{1}$ (resp. rows of $H_{2}$ ).

•

No noise is added to the first $r_{1}$ columns and last $r_{2}$ rows of $M^{s}$ , that is, $N(:,1:r_{1})=0$ and $N(m-r_{2}+1:m,:)=0$ , while we set $N(1:m-r_{2},r_{1}+1:n)$ equal to

[TABLE]

where $\bar{w}$ and $\bar{h}$ are the average of the columns of $W_{1}$ and rows of $H_{2}$ , respectively, that is, $\bar{w}=\frac{1}{r_{1}}W_{1}e$ and $\bar{h}=\frac{1}{r_{2}}e^{T}H_{2}$ . Intuitively, the noise will move the data point towards the outside of the convex hull of the columns of $W_{1}$ and the rows of $H_{2}$ . The noise matrix $N$ is normalized so that $||N||_{F}=\epsilon||M^{s}||_{F}$ .

This example is inspired by the so-called middle point experiment from [23]. Intuitively, we are moving the data points towards the outside of the set spanned by $W_{1}$ and $H_{2}$ .

We use the same strategy for the choice of the noise levels, and report the average quality measures over 25 trials on Figures 4, 5 and 6.

We observe the following:

•

In terms of accuracy, the observations are similar than for the fully random synthetic data sets. SPA-R and SPA-C are naturally not able to have a good accuracy. Note however that SPA-R performs better than SPA-C because there are more separable rows (12) than columns (10). Moreover, GS-FGM is the only algorithm able to recover the column and row indices perfectly for $\epsilon\leq 0.113$ . GSPA performs almost as well but cannot extract all indices (see the zoomed-in graph on Figure 4). SPA∗ performs in between.

•

In terms of approximation error, the behavior is rather interesting: GS-FGM outperforms NMF for low-noise levels ( $\epsilon\leq 0.01$ ) while, for larger noise levels, NMF (and to a lesser extent MV-NMF) performs better. The reason of the worse performance of NMF is that the problem is more complicated and the NMF algorithm gets stuck in bad local minima. This is a rather interesting observation: using the GS prior, one can identify better solutions than standard NMF.

•

In terms of distance to the ground truth, we oberve a similar behavior as for the fully random synthetic data sets except that NMF and MV-NMF perform even worse because of the more complicated structure of the data.

This second experiment shows the superiority of GS-NMF compared to NMF and separable NMF: GS-NMF allows to identify the true underlying factors, leading to low approximation errors. Among GS-NMF algorithms (namely, GS-FGM, GSPA and SPA*), GS-FGM performs best producing solutions with higher accuracy, lower approximation error and better identified factors. The second best is GSPA.

5.2 Document data sets

In this section, we compare the different algorithms on documents data sets. We use the TDT30 data set [6], and the 14 data sets from [48]. Note that document data sets are sparse hence are not necessarily scalable hence we did not scale the input matrix.

For GS-FGM, we try 10 different values of $\tilde{\lambda}$ chosen in $[10^{-3},10]$ with 10 log-spaced values (in Matlab, logspace(-3,1,10)), and keep the solution with the highest approximation quality. The approximation quality is defined as one minus the relative approximation error (23); hence the higher the better. As opposed to the synthetic data sets, the numbers $r_{1}$ and $r_{2}$ are unknown. To evaluate $(r_{1},r_{2})$ when using GS-FGM, we use the strategy described in Section 3.2 for real data sets.

Subsampling. For the document data sets, the size of input data matrix can be very large (the number of words is typically of the order of $10^{4}$ ). It is impractical to apply GS-FGM such data sets since GS-FGM runs in $\mathcal{O}(mn^{2}+nm^{2})$ operations. Similarly as done in [22], we preselect a subset of columns and rows of the input matrix. To do so, we adopt the hierarchical clustering from [20], running on average in $\mathcal{O}(mn\log_{2}C)$ , where $C$ is the number of the clusters to generate. For tr11 and tr23 data sets, since the number of documents is relatively small (414 for tr11, 204 for tr23), we keep all the documents and extract 500 words. For Newsgroups 20, which is a very large data set, we only consider the first 10 classes and refer to the corresponding data set as NG10. For the other data sets, we extract 500 documents and 500 words, and consider a submatrix matrix $M_{s}\in\mathbb{R}^{500\times 500}$ . However, we take into account the importance of each selected column and row by identifying the the number of data points attached to it (this is given by the hierarchical clustering). To do so, we scale it using the square root of the number of points belonging to its cluster.

Finally, each algorithm will identify a subset of $r_{1}$ columns and $r_{2}$ rows of the subsampled matrix. From these subsets, we identify the corresponding columns and rows of the original matrix, and Table 1 reports the approximation quality (23) of the different algorithms. It also reports the approximation quality of the rank- $r$ truncated SVD, that is, $1-\frac{||M-M_{r}||_{F}}{||M||_{F}}$ where $M_{r}$ is the best rank- $r$ approximation of $M$ , to serve as a reference. We also only run the separable NMF variants extracting $r_{1}$ columns and $r_{2}$ rows using the values of $(r_{1},r_{2})$ identified by GS-FGM.

We observe the following:

•

GS-FGM and GSPA provide the same solutions in 6 out of the 15 cases. In 5 out of these 6 cases, SPA* provide the same solution.

•

As opposed to the synthetic data sets, SPA-C and SPA-R sometimes perform best, although never significantly better than GS-FGM.

•

GS-FGM performs on average the best, having in all cases the highest or second highest relative approximation quality.

NMF and MV-NMF provide solution with lower approximation error. This is expected since GS-NMF is much more constrained than these NMF variants while the data set is far from being a GS matrix. In fact, we observe that these data sets are not even close to being low rank; see the last column of Table 1 where the relative approximation quality of the truncated SVD is below 10% for many data sets. However, it makes sense to perform low-rank approximations to extract meaningful patterns in these documents. In particular, GS-NMF provides subsets of important words and documents; see Table 2 for an example. This illustrates the advantage of interpretability of GS-NMF compared to standard NMF approaches.

The last line of Table 1 reports the average computational time in seconds for the different algorithms. As expected, GS-FGM is slower but the computational time is reasonable for such matrices (below 2.5 seconds in all cases, with an average of 1.25 seconds). Note that NMF and MV-NMF are slower because they are applied directly to the full data sets.

5.3 Facial image data sets

In this section, the algorithms are applied on facial image data sets. In this context, GS-NMF will identify important subjects and important pixels that allow to reconstruct as best as possible the original images. We use the following facial image data sets:

•

The CBCL data set is a public database for research usage provided by the MIT center for Biological and Computation Learning. It consists 2429 face images of size $19\times 19$ so that the input pixel-by-face matrix has dimension $361\times 2429$ . We set $r=49$ as in [29].

•

The Frey data set is collected by Brendan Frey. It contains 1965 images of Brendan’s face and the size of each image is $20\times 28$ so that the input pixel-by-face matrix has dimension $560\times 1965$ . We set $r=50$ .

•

The Yale data set contains 38 individuals, each of which as 64 frontal face images under different lighting conditions. The images are size of $192\times 168$ which is too large for our purpose (see the discussion in the previous section) hence all the images are downsampled to have size $48\times 42$ . We also select 10 face images from each individual randomly and obtain 380 images. Finally, the pixel-by-face matrix has dimension $2016\times 380$ . We set $r=n/10=38$ .

•

The ORL data set contains a set of faces taken between April 1992 and April 1994 at the Olivetti Research Laboratory in Cambridge, UK. There are ten different images of each of the 40 distinct subjects, each image is size of $112\times 92$ . We subsample each image to obtain images of size $23\times 19$ . The pixel-by-face matrix has dimension $437\times 400$ . We set $r=n/10=40$ .

Note that the factorization ranks were chosen rather arbitrarily; we refer the reader to [44] for a discussion on the choice of $r$ . We use the same strategy to tune $\tilde{\lambda}$ in GS-FGM as for document data sets. To give each facial image the same importance, we scale them so that their $\ell_{1}$ norm is equal to one. The relative approximation quality of the factorizations provided by the different algorithms are reported in Table 3.

For these data sets, GS-FGM outperforms GSPA. However, SPA-R works very well, slightly better than GS-FGM on CBCL and ORL databases and worse on the Frey and Yale data sets. The reason is that extracting representative faces within a set of images is not always very appropriate because of the nonnegativity constraints. In some sense, the GS-NMF model is not ideal in this situation, but it is still able to provide meaningful results; for example, for the Yale data sets, it provides significantly lower approximation error than all other algorithms. Here the non-uniqueness issue plays a role. For example, on the Frey data set, we see that using a (0,49)-separable approximation (SPA-R) leads to an error very close to a $(35,14)$ -separable approximation (GS-FGM).

Similarly as for the document data sets, NMF and MV-NMF provide solution with lower approximation error.

Figures 7, 9, 11 and 13 provide a visual representations of the solutions generated by GS-FGM: for each data set, they display the positions of the selected pixels and the selected representative faces. It is interesting to observe the location of the selected pixels: they are either located on the edge (where pixels behave rather differently, not being part of the faces) or are well spread around the center of the face. The selected faces represent rather different faces from the data sets. For the CBCL and ORL data sets, the selected faces either come from different persons that look rather different, or of the same person in very different positions or with different illuminations (Figures 7 and 13). For the Frey data sets, the selected faces represent different emotions (Figure 9). For the Yale data sets, the selected faces represent different persons and illuminations (Figure 11).

Figures 8, 10, 12, and 14 display some sample images from the different data sets and their reconstruction using GS-FGM.

Table 4 reports the computational time for the different algorithms. As expected, GS-FGM is slower; in particular for the largest data set, namely the Yale data set ( $2016\times 380$ ), where GS-FGM requires 35 seconds.

5.4 Take-home messages from the numerical experiments

In terms of approximation error, GS-NMF provides in general results that are better than separable NMF algorithms. For synthetic data sets, where the input data is close to being a GS matrix, GS-NMF competes favourably with NMF and MV-NMF. In particular, it is able to recover the ground truth factors while standard NMF algorithms fail to do so. Moreover, for more complicated data sets (see Section 5.1.2), GS-NMF can even produce solutions with much lower approximation error than NMF whose solutions are stuck at bad local minima. For real data sets, GS-NMF produces solutions with higher approximation error, because of the strong model assumptions. However, it has the advantage to produce highly interpretable solutions. The improved interpretability was exemplified on a document data (see Table 2), and on facial images where GS-NMF identified important pixels and subjects in a set of facial images (see Figures 7, 9, 11 and 13), which is not possible with any other current NMF algorithm.

6 Conclusion

In this paper, we have generalized separable NMF: instead of only selecting columns of the input matrix to approximate it, we allow for columns and rows to be selected. We refer to this problem as generalized separable NMF (GS-NMF). We studied some interesting properties of matrices that can be decomposed using GS-NMF; they are referred to as GS matrices. In particular, we showed that GS-NMF can represent matrices much more compactly than separable NMF. Then, we proposed a convex optimization model to tackle GS-NMF, and developed a fast gradient method to solve the model. We also proposed a heuristic algorithm inspired by the successive projection algorithm from the separable NMF literature. We compared the algorithms on synthetic, document and image data sets and showed that they are able, in most cases, to generate decompositions with smaller approximation error than separable NMF algorithms.

Compared to standard NMF algorithms, GS-NMF provides decompositions with higher approximation errors (because of the additional constraints in the decomposition) but provides meaningful and easily interpretable factors. For example, for facial images, GS-NMF identifies important pixels and subjects in a data set. Moreover, for synthetic data sets, GS-NMF was able to recover the ground truth factors, sometimes leading to much lower approximation error than NMF algorithms.

Further work include to deepen our understanding of GS matrices. This would hopefully allow for example to design more efficient algorithms that provably recover optimal decompositions under suitable conditions (e.g., uniqueness) and in the presence of noise; as done for separable NMF algorithms.

Bibliography48

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] M. C. U. Araújo, T. C. B. Saldanha, R. K. H. Galvao, T. Yoneyama, H. C. Chame, and V. Visani. The successive projections algorithm for variable selection in spectroscopic multicomponent analysis. Chemometrics and Intelligent Laboratory Systems , 57(2):65–73, 2001.
2[2] S. Arora, R. Ge, Y. Halpern, D. Mimno, A. Moitra, D. Sontag, Y. Wu, and M. Zhu. A practical algorithm for topic modeling with provable guarantees. In International Conference on Machine Learning , pages 280–288, 2013.
3[3] S. Arora, R. Ge, R. Kannan, and A. Moitra. Computing a nonnegative matrix factorization–provably. In Proceedings of the forty-fourth annual ACM symposium on Theory of computing , pages 145–162. ACM, 2012.
4[4] S. Arora, R. Ge, R. Kannan, and A. Moitra. Computing a nonnegative matrix factorization—provably. SIAM Journal on Computing , 45(4):1582–1611, 2016.
5[5] S. Arora, R. Ge, and A. Moitra. Learning topic models–going beyond svd. In Foundations of Computer Science (FOCS), 2012 IEEE 53rd Annual Symposium on , pages 1–10. IEEE, 2012.
6[6] D. Cai, Q. Mei, J. Han, and C. Zhai. Modeling hidden topics on document manifold. In Proceedings of the 17th ACM conference on Information and knowledge management , pages 911–920. ACM, 2008.
7[7] T.-H. Chan, W.-K. Ma, A. Ambikapathi, and C.-Y. Chi. A simplex volume maximization framework for hyperspectral endmember extraction. IEEE Transactions on Geoscience and Remote Sensing , 49(11):4177–4193, 2011.
8[8] T.-H. Chan, W.-K. Ma, C.-Y. Chi, and Y. Wang. A convex analysis framework for blind separation of non-negative sources. IEEE Transactions on Signal Processing , 56(10):5120–5134, 2008.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Generalized Separable Nonnegative Matrix Factorization

Abstract

1 Introduction

Definition 1**.**

1.1 Related problems

1.2 Outline and contribution of the paper

2 Properties of GS Matrices

Property 1** (Pattern of zeros).**

Proof.

Property 2** (Equivalent characterization 1).**

Proof.

Property 3** (Equivalent characterization 2).**

Proof.

Definition 2**.**

Property 4** (Idealized model).**

Proof.

Lemma 1**.**

Proof.

Property 5** (Compression).**

Proof.

Property 6** (Scaling).**

Proof.

2.1 Unicity of GS decompositions

Property 7** (Rank-one matrices).**

Proof.

Property 8** (Rank-two matrices).**

Proof.

Property 9** (Construction of non-unique minimal (r1,r2)(r_{1},r_{2})(r1​,r2​)-separable matrices).**

Proof.

Property 10** (Condition for uniqueness).**

Proof.

3 Convex Optimization Model and Fast Gradient Method

3.1 Convex optimization model

Definition 3**.**

Property 11**.**

Proof.

3.2 Fast Gradient Method for GS-NMF

4 Heuristic Algorithm for GS-NMF

Example 1**.**

5 Numerical Experiments

Remark 1**.**

5.1 Synthetic data sets

5.1.1 Fully randomly generated data

5.1.2 Middle points and adversarial noise

5.2 Document data sets

5.3 Facial image data sets

5.4 Take-home messages from the numerical experiments

6 Conclusion

Definition 1.

Property 1 (Pattern of zeros).

Property 2 (Equivalent characterization 1).

Property 3 (Equivalent characterization 2).

Definition 2.

Property 4 (Idealized model).

Lemma 1.

Property 5 (Compression).

Property 6 (Scaling).

Property 7 (Rank-one matrices).

Property 8 (Rank-two matrices).

Property 9 (Construction of non-unique minimal $(r_{1},r_{2})$ -separable matrices).

Property 10 (Condition for uniqueness).

Definition 3.

Property 11.

Example 1.

Remark 1.