A Nonconvex Splitting Method for Symmetric Nonnegative Matrix   Factorization: Convergence Analysis and Optimality

Songtao Lu; Mingyi Hong; and Zhengdao Wang

arXiv:1703.08267·math.OC·March 27, 2017·IEEE Trans. Signal Process.

A Nonconvex Splitting Method for Symmetric Nonnegative Matrix Factorization: Convergence Analysis and Optimality

Songtao Lu, Mingyi Hong, and Zhengdao Wang

PDF

TL;DR

This paper introduces a novel nonconvex splitting algorithm for symmetric nonnegative matrix factorization, guaranteeing convergence to KKT points, with proven convergence rates and conditions for optimality, applicable to data clustering and segmentation.

Contribution

It presents a new nonconvex splitting method for SymNMF with convergence guarantees, parallel implementation, and conditions for global and local optimality.

Findings

01

Algorithm converges quickly to a local minimum.

02

Guarantees convergence to KKT points with a sublinear rate.

03

Effective on synthetic and real datasets.

Abstract

Symmetric nonnegative matrix factorization (SymNMF) has important applications in data analytics problems such as document clustering, community detection and image segmentation. In this paper, we propose a novel nonconvex variable splitting method for solving SymNMF. The proposed algorithm is guaranteed to converge to the set of Karush-Kuhn-Tucker (KKT) points of the nonconvex SymNMF problem. Furthermore, it achieves a global sublinear convergence rate. We also show that the algorithm can be efficiently implemented in parallel. Further, sufficient conditions are provided which guarantee the global and local optimality of the obtained solutions. Extensive numerical results performed on both synthetic and real data sets suggest that the proposed algorithm converges quickly to a local minimum solution.

Tables3

Table 1. TABLE I: Local Optimality

$N$	$λ_{\min} (𝐓)$	$δ$	Local Optimality (true)
50	$2.71 \times 10^{- 4}$	0.42	100%
100	$4.16 \times 10^{- 4}$	0.37	100%
500	$1.8 \times 10^{- 2}$	0.91	100%

Table 2. TABLE II: Mean and Standard Deviation of ‖ 𝐗𝐗 T − 𝐙 ‖ F 2 / ‖ 𝐙 ‖ F 2 subscript superscript norm superscript 𝐗𝐗 𝑇 𝐙 2 𝐹 subscript superscript norm 𝐙 2 𝐹 \|\mathbf{X}\mathbf{X}^{\scriptscriptstyle T}-\mathbf{Z}\|^{2}_{F}/\|\mathbf{Z}\|^{2}_{F} of the Final Solution of Each Algorithm based on Random Initializations

Dense Data Sets	$N$	$K$	NS-SymNMF	PGD [22]	PNewton [22]	ANLS [11]	SNMF [10]	CD [23]
Reuters [49]	4,633	25	2.65e-3 $\pm$ 3.31e-10	1.14e-2 $\pm$ 1.18e-5	2.98e-3 $\pm$ 3.71e-6	1.16e-2 $\pm$ 1.61e-5	9.32e-3	2.66e-3 $\pm$ 2.04e-8
TDT2 [49]	8,939	25	1.01e-2 $\pm$ 5.35e-9	1.74e-2 $\pm$ 7.34e-6	-	2.25e-2 $\pm$ 1.25e-6	3.29e-2	1.01e-2 $\pm$ 1.21e-6

Table 3. TABLE III: Mean and Standard Deviation of ‖ 𝐗𝐗 T − 𝐙 ‖ F 2 / ‖ 𝐙 ‖ F 2 subscript superscript norm superscript 𝐗𝐗 𝑇 𝐙 2 𝐹 subscript superscript norm 𝐙 2 𝐹 \|\mathbf{X}\mathbf{X}^{\scriptscriptstyle T}-\mathbf{Z}\|^{2}_{F}/\|\mathbf{Z}\|^{2}_{F} of the Final Solution of Each Algorithm based on Random Initializations

Sparse Data Sets	$N$	$K$	#nonzero	NS-SymNMF	ANLS [11]	SNMF [10]	CD [23]
email-Enron [50]	36,692	50	367,662	8.05e-1 $\pm$ 4.66e-4	9.18e-1 $\pm$ 6.20e-3	9.69e-1	8.13e-1 $\pm$ 1.47e-3
loc-Brightkite [51]	58,228	50	428,156	8.75e-1 $\pm$ 9.52e-4	9.33e-1 $\pm$ 1.93e-3	9.43e-1	8.84e-1 $\pm$ 1.49e-3

Equations351

X \geq 0 min f (X) = \frac{1}{2} ∥ X X^{T} - Z ∥_{F}^{2} .

X \geq 0 min f (X) = \frac{1}{2} ∥ X X^{T} - Z ∥_{F}^{2} .

Y \geq 0, X = Y min \frac{1}{2} ∥ X Y^{T} - Z ∥_{F}^{2} .

Y \geq 0, X = Y min \frac{1}{2} ∥ X Y^{T} - Z ∥_{F}^{2} .

L (X, Y; Λ) = \frac{1}{2} ∥ X Y^{T} - Z ∥_{F}^{2} + ⟨ Y - X, Λ ⟩ + \frac{ρ}{2} ∥ Y - X ∥_{F}^{2}

L (X, Y; Λ) = \frac{1}{2} ∥ X Y^{T} - Z ∥_{F}^{2} + ⟨ Y - X, Λ ⟩ + \frac{ρ}{2} ∥ Y - X ∥_{F}^{2}

X, Y min

X, Y min

Y \geq 0, X = Y, ∥ Y_{i} ∥_{2}^{2} \leq τ, \forall i,

2 (X^{*} (X^{*})^{T} - \frac{Z ^{T} + Z}{2}) X^{*} - Ω^{*} = 0,

2 (X^{*} (X^{*})^{T} - \frac{Z ^{T} + Z}{2}) X^{*} - Ω^{*} = 0,

Ω^{*} \geq 0,

X^{*} \geq 0,

X^{*} \circ Ω^{*} = 0

\bigg{\langle}\big{(}\mathbf{X}^{*}(\mathbf{X}^{*})^{\scriptscriptstyle T}-\frac{\mathbf{Z}^{\scriptscriptstyle T}+\mathbf{Z}}{2}\big{)}\mathbf{X}^{*},\mathbf{X}-\mathbf{X}^{*}\bigg{\rangle}\geq 0,\quad\forall\;\mathbf{X}\geq 0.

\bigg{\langle}\big{(}\mathbf{X}^{*}(\mathbf{X}^{*})^{\scriptscriptstyle T}-\frac{\mathbf{Z}^{\scriptscriptstyle T}+\mathbf{Z}}{2}\big{)}\mathbf{X}^{*},\mathbf{X}-\mathbf{X}^{*}\bigg{\rangle}\geq 0,\quad\forall\;\mathbf{X}\geq 0.

θ_{k} ≜ \frac{Z _{k, k} + \frac{1}{2} \sum _{i = 1}^{N} ( Z _{i, k} + Z _{k, i} ) ^{2}}{2},

θ_{k} ≜ \frac{Z _{k, k} + \frac{1}{2} \sum _{i = 1}^{N} ( Z _{i, k} + Z _{k, i} ) ^{2}}{2},

Y^{(t + 1)} =

Y^{(t + 1)} =

+

X^{(t + 1)} =

+

Λ^{(t + 1)} =

β^{(t + 1)} =

ρ > 6 N τ .

ρ > 6 N τ .

t \to \infty lim ∥ X^{(t)} - Y^{(t)} ∥_{F}^{2} \to 0.

t \to \infty lim ∥ X^{(t)} - Y^{(t)} ∥_{F}^{2} \to 0.

\widetilde{\nabla}\mathcal{L}(\mathbf{X},\mathbf{Y},\mathbf{\Lambda})\triangleq\left[\begin{array}[]{l}\mathbf{Y}^{\scriptscriptstyle T}-\textsf{proj}_{\mathcal{Y}}[\mathbf{Y}^{\scriptscriptstyle T}-\nabla_{\mathbf{Y}}(\mathcal{L}(\mathbf{Y},\mathbf{X},\mathbf{\Lambda})]\\ \nabla_{\mathbf{X}}\mathcal{L}(\mathbf{X},\mathbf{Y},\mathbf{\Lambda})\end{array}\right]

\widetilde{\nabla}\mathcal{L}(\mathbf{X},\mathbf{Y},\mathbf{\Lambda})\triangleq\left[\begin{array}[]{l}\mathbf{Y}^{\scriptscriptstyle T}-\textsf{proj}_{\mathcal{Y}}[\mathbf{Y}^{\scriptscriptstyle T}-\nabla_{\mathbf{Y}}(\mathcal{L}(\mathbf{Y},\mathbf{X},\mathbf{\Lambda})]\\ \nabla_{\mathbf{X}}\mathcal{L}(\mathbf{X},\mathbf{Y},\mathbf{\Lambda})\end{array}\right]

proj_{Y} (W) ≜ ar g Y \geq 0, ∥ Y_{i} ∥_{2}^{2} \leq τ, \forall i min ∥ W - Y ∥_{F}^{2}

proj_{Y} (W) ≜ ar g Y \geq 0, ∥ Y_{i} ∥_{2}^{2} \leq τ, \forall i min ∥ W - Y ∥_{F}^{2}

P (X^{(t)}, Y^{(t)}, Λ^{(t)}) ≜ ∥ \nabla L (X^{(t)}, Y^{(t)}, Λ^{(t)}) ∥_{F}^{2} + ∥ X^{(t)} - Y^{(t)} ∥_{F}^{2} .

P (X^{(t)}, Y^{(t)}, Λ^{(t)}) ≜ ∥ \nabla L (X^{(t)}, Y^{(t)}, Λ^{(t)}) ∥_{F}^{2} + ∥ X^{(t)} - Y^{(t)} ∥_{F}^{2} .

T (ϵ) ≜ min {t ∣ P (X^{(t)}, Y^{(t)}, Λ^{(t)}) \leq ϵ, t \geq 0} .

T (ϵ) ≜ min {t ∣ P (X^{(t)}, Y^{(t)}, Λ^{(t)}) \leq ϵ, t \geq 0} .

ϵ \leq \frac{C L ( X ^{(1)} , Y ^{(1)} , Λ ^{(1)} )}{T ( ϵ )} .

ϵ \leq \frac{C L ( X ^{(1)} , Y ^{(1)} , Λ ^{(1)} )}{T ( ϵ )} .

S ≜ X^{*} (X^{*})^{T} - \frac{Z ^{T} + Z}{2} ⪰ 0.

S ≜ X^{*} (X^{*})^{T} - \frac{Z ^{T} + Z}{2} ⪰ 0.

T_{m, n} ≜ ((X_{m}^{' *})^{T} X_{n}^{' *} - δ ∥ X_{n}^{' *} ∥_{2}^{2}) I + X_{n}^{' *} (X_{m}^{' *})^{T} + δ_{m, n} S,

T_{m, n} ≜ ((X_{m}^{' *})^{T} X_{n}^{' *} - δ ∥ X_{n}^{' *} ∥_{2}^{2}) I + X_{n}^{' *} (X_{m}^{' *})^{T} + δ_{m, n} S,

f (X) \geq f (X^{*}) + \frac{γ}{2} ∥ X - X^{*} ∥_{F}^{2} .

f (X) \geq f (X^{*}) + \frac{γ}{2} ∥ X - X^{*} ∥_{F}^{2} .

γ = - (\frac{2 K ^{2}}{δ} + K (K - 2)) ϵ^{2} + 2 λ_{m i n} (T) > 0

γ = - (\frac{2 K ^{2}}{δ} + K (K - 2)) ϵ^{2} + 2 λ_{m i n} (T) > 0

T_{1} ≜ (1 - δ) ∥ x^{*} ∥_{2}^{2} I + 2 x^{*} (x^{*})^{T} - \frac{Z ^{T} + Z}{2} ≻ 0,

T_{1} ≜ (1 - δ) ∥ x^{*} ∥_{2}^{2} I + 2 x^{*} (x^{*})^{T} - \frac{Z ^{T} + Z}{2} ≻ 0,

X min ∥ Z_{X}^{(t + 1)} - X A_{X}^{(t + 1)} ∥_{F}^{2}

X min ∥ Z_{X}^{(t + 1)} - X A_{X}^{(t + 1)} ∥_{F}^{2}

Z_{X}^{(t + 1)} ≜ Z Y^{(t + 1)} + Λ^{(t)} + ρ Y^{(t + 1)}

Z_{X}^{(t + 1)} ≜ Z Y^{(t + 1)} + Λ^{(t)} + ρ Y^{(t + 1)}

A_{X}^{(t + 1)} ≜ (Y^{(t + 1)})^{T} Y^{(t + 1)} + ρ I ≻ 0

X^{(t + 1)} = Z_{X}^{(t + 1)} (A_{X}^{(t + 1)})^{- 1} .

X^{(t + 1)} = Z_{X}^{(t + 1)} (A_{X}^{(t + 1)})^{- 1} .

Y_{i}^{(r + 1)} = proj_{Y} (Y_{i}^{(r)} - α (A_{Y}^{(t)} Y_{i}^{(r)} - Z_{Y, i}^{(t)}))

Y_{i}^{(r + 1)} = proj_{Y} (Y_{i}^{(r)} - α (A_{Y}^{(t)} Y_{i}^{(r)} - Z_{Y, i}^{(t)}))

Z_{Y}^{(t)}

Z_{Y}^{(t)}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

A Nonconvex Splitting Method for Symmetric Nonnegative Matrix Factorization:

Convergence Analysis and Optimality

Songtao Lu, Student Member, IEEE, Mingyi Hong, Member, IEEE and Zhengdao Wang, Fellow, IEEE Manuscript received May 15, 2016; revised October 6, 2016, January 6, 2017, and February 16, 2017; accepted February 20, 2017. The associate editor coordinating the review of this manuscript and approving it for publication was Marco Moretti. Part of the paper was presented at the 42nd IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), New Orleans, March 5–9, 2017. This work was supported in part by NSF under Grants No. 1523374 and No. 1526078, and by AFOSR under Grant No. 15RT0767. Songtao Lu and Zhengdao Wang are with the Department of Electrical and Computer Engineering, Iowa State University, Ames, IA 50011, USA (emails: {songtao, zhengdao}@iastate.edu). Mingyi Hong is with the Department of Industrial and Manufacturing Systems Engineering, Iowa State University, Ames, IA 50011, USA (email: [email protected]).

Abstract

Symmetric nonnegative matrix factorization (SymNMF) has important applications in data analytics problems such as document clustering, community detection and image segmentation. In this paper, we propose a novel nonconvex variable splitting method for solving SymNMF. The proposed algorithm is guaranteed to converge to the set of Karush-Kuhn-Tucker (KKT) points of the nonconvex SymNMF problem. Furthermore, it achieves a global sublinear convergence rate. We also show that the algorithm can be efficiently implemented in parallel. Further, sufficient conditions are provided which guarantee the global and local optimality of the obtained solutions. Extensive numerical results performed on both synthetic and real data sets suggest that the proposed algorithm converges quickly to a local minimum solution.

Index Terms:

Symmetric nonnegative matrix factorization, Karush-Kuhn-Tucker points, variable splitting, global and local optimality, clustering

I Introduction

Nonnegative matrix factorization (NMF) refers to factoring a given matrix into the product of two matrices whose entries are all nonnegative. It has long been recognized as an important matrix decomposition problem [1, 2]. The requirement that the factors are component-wise nonnegative makes NMF distinct from traditional methods such as the principal component analysis (PCA) and the linear discriminant analysis (LDA), leading to many interesting applications in imaging, signal processing and machine learning [3, 4, 5, 6, 7]; see [8] for a recent survey. When further requiring that the two factors are identical after transposition, NMF becomes the so-called symmetric nonnegative matrix factorization (SymNMF). In the case where the given matrix cannot be factorized exactly, an approximate solution with a suitably defined approximation error is desired. Mathematically, SymNMF approximates a given (usually symmetric) nonnegative matrix $\mathbf{Z}\in\mathbb{R}^{N\times N}$ by a low rank matrix $\mathbf{X}\mathbf{X}^{\scriptscriptstyle T}$ , where the factor matrix $\mathbf{X}\in\mathbb{R}^{N\times K}$ is component-wise nonnegative, typically with $K\ll N$ . Let $\|\cdot\|_{F}$ denote the Frobenius norm. The problem can be formulated as a nonconvex optimization problem [9, 10, 11]:

[TABLE]

Recently, SymNMF has found many applications in document clustering, community detection, image segmentation and pattern clustering in bioinformatics [11, 12, 9]. An important class of clustering methods is known as spectral clustering, e.g., [13, 14], which is based on the eigenvalue decomposition of some transformed graph Laplacian matrix. In [15], it has been shown that spectral clustering and SymNMF are two different ways of relaxing the kernel $K$ -means clustering, where the former relaxes the nonnegativity constraint while the latter relaxes certain orthogonality constraint. SymNMF also has the advantage of often yielding more meaningful and interpretable results [11].

I-A Related Work

Due to the importance of the NMF problem, many algorithms have been proposed in the literature for finding its high-quality solutions. Well-known algorithms include the multiplicative update [6], alternating projected gradient methods [16], alternating nonnegative least squares (ANLS) with the active set method [17] and a few recent methods such as the bilinear generalized approximate message passing [18, 19], as well as methods based on the block coordinate descent [20]. These methods often possess strong convergence guarantees (to Karush-Kuhn-Tucker (KKT) points of the NMF problem) and most of them lead to satisfactory performance in practice; see [8] and the references therein for detailed comparison and comments for different algorithms. Unfortunately, most of the aforementioned methods for NMF lack effective mechanisms to enforce the symmetry between the resulting factors, therefore they are not directly applicable to SymNMF. Recently, there have been works focusing on customized algorithms for SymNMF, which we review below.

To this end, first rewrite SymNMF equivalently as

[TABLE]

A simple strategy is to ignore the equality constraint $\mathbf{X}=\mathbf{Y}$ , and then alternatingly perform the following two steps: 1) solving $\mathbf{Y}$ with $\mathbf{X}$ being fixed (a nonnegative least squares problem); 2) solving $\mathbf{X}$ with $\mathbf{Y}$ being fixed (a least squares problem). Such ANLS algorithm has been proposed in [11] for dealing with SymNMF. Unfortunately, despite the fact that an optimal solution can be obtained in each subproblem, there is no guarantee that the $\mathbf{Y}$ -iterate will converge to the $\mathbf{X}$ -iterate. The algorithm in [11] adds a regularized term for the difference between the two factors to the objective function and explicitly enforces that the two matrices are equal at the output. Such an extra step enforces symmetry, but unfortunately also leads to the loss of global convergence guarantees. A related ANLS-based method has been introduced in [10]; however the algorithm is based on the assumption that there exists an exact symmetric factorization (i.e., $\exists~{}\mathbf{X}\geq 0$ such that $\mathbf{X}\mathbf{X}^{\scriptscriptstyle T}=\mathbf{Z}$ ). Without such assumption, the algorithm may not converge to the set of KKT points111Let $d(a,s)$ denote the distance between two points $a$ and $s$ . We say that a sequence $a_{i}$ converges to a set $\mathcal{S}$ if the distance between $a_{i}$ and $\mathcal{S}$ , defined as $\inf_{s\in\mathcal{S}}d(a_{i},s)$ , converges to zero, as $i\to\infty$ . of problem (1). A multiplicative update for SymNMF has been proposed in [9], but the algorithm lacks convergence guarantees (to KKT points of problem (1)) [21], and has a much slower convergence speed than the one proposed in [10]. In [11, 22], algorithms based on the projected gradient descent (PGD) and the projected Newton (PNewton) have been proposed, both of which directly solve the original formulation (1). Again there has been no global convergence analysis since the objective function is a nonconvex fourth-order polynomial. More recently, the work [23] applies the nonconvex coordinate descent (CD) algorithm for SymNMF. Due to the fact that the minimizer of the fourth order polynomial is not unique in each coordinate updating, the CD-based method may not converge to stationary points.

Another popular method for NMF is based on the alternating direction method of multipliers (ADMM), which is a flexible tool for large scale convex optimization [24]. For example, using ADMM for both NMF and matrix completion, high quality results have been obtained in [25] for gray-scale and hyperspectral image recovery. Furthermore, ADMM has been applied to generalized versions of NMF where the objective function is the general beta-divergence [26]. A hybrid alternating optimization and ADMM method was proposed for NMF, as well as tensor factorization, under a variety of constraints and loss measures in [27]. However, despite the promising numerical results, none of the works discussed above has rigorous theoretical justification for SymNMF. Recently, the work [28] has applied the ADMM for NMF and provided one of the first analysis for using ADMM to solve nonconvex matrix-factorization type problems. However, it is important to note that the algorithm in [28] does not apply to the SymNMF case, because our problem is more restrictive in that symmetric factors are desired, while in NMF symmetry is not enforced. Technically, imposing symmetry poses much difficulty in the analysis (we will comment on this point shortly). In fact, the convergence of ADMM for SymNMF is still open in the literature.

An important research question for NMF and SymNMF is whether it is possible to design algorithms that lead to globally optimal solutions. At the first sight such problem appears very challenging since finding the exact NMF is NP-hard [29] and checking whether a positive semidefinite matrix can be decomposed exactly by SymNMF is also NP-hard [30]. However, some promising recent findings suggest that when the structure of the underlying factors are appropriately utilized, it is possible to obtain rather strong results. For example, in [31], the authors have shown that for the low rank factorized stochastic optimization problem where the two low rank matrices are symmetric, a modified stochastic gradient descent algorithm is capable of converging to a global optimum with constant probability from a random starting point. Related works also include [32, 33, 34]. However, when the factors are required to be nonnegative and symmetric, it is no longer clear whether the existing analysis can still be used to show convergence to global/local optimal points. For the nonnegative principal component problem (i.e., finding the leading nonnegative eigenvector) under the spiked model, reference [35] shows that certain approximate message passing algorithm is able to find the global optimal solution asymptotically. Unfortunately, this analysis does not generalize to an arbitrary symmetric observation matrix for the case $K>1$ . To our best knowledge, a characterization of global and local optimal solutions for SymNMF is still lacking.

I-B Contributions

In this paper, we first propose a novel algorithm for SymNMF, which utilizes nonconvex splitting and is capable of converging to the set of KKT points with a provable global convergence rate. The main idea is to relax the symmetry requirement at the beginning and gradually enforce it as the algorithm proceeds. Second, we provide a number of easy-to-check sufficient conditions guaranteeing the local or global optimality of the obtained solutions. Numerical results on both synthetic and real data show that the proposed algorithm achieves fast and stable convergence (often to local minimum solutions) with low computational complexity.

More specifically, the main contributions of this paper are:

We design a novel nonconvex splitting SymNMF (NS-SymNMF) algorithm, which converges to the set of KKT points of SymNMF with a global sublinear rate. To our best knowledge, it is the first SymNMF solver that possesses global convergence rate guarantees.
We provide a set of easily checkable sufficient conditions (which only involve finding the smallest eigenvalue of certain matrix) that characterize the global and local optimality of the solutions. By utilizing such conditions, we demonstrate numerically that with high probability, our proposed algorithm converges not only to the set of KKT points but to a local optimal solution as well.

Notation: Bold upper case letters without subscripts (e.g., $\mathbf{X},\mathbf{Y}$ ) denote matrices and bold lower case letters without subscripts (e.g., $\mathbf{x},\mathbf{y}$ ) represent vectors. The notation $\mathbf{Z}_{i,j}$ denotes the $(i,j)$ -th entry of matrix $\mathbf{Z}$ . Vector $\mathbf{X}_{i}$ denotes the $i$ th row of matrix $\mathbf{X}$ and $\mathbf{X}^{\prime}_{m}$ denotes the $m$ th column of the matrix.

II The Proposed Algorithm

The proposed algorithm leverages the reformulation (2). Our main idea is to gradually tighten the difficult equality constraint $\mathbf{X}=\mathbf{Y}$ as the algorithm proceeds so that when convergence is approached, such equality is eventually satisfied. To this end, let us construct the augmented Lagrangian for (2), given by

[TABLE]

where $\mathbf{\Lambda}\in\mathbb{R}^{N\times K}$ is a matrix of dual variables, $\langle\cdot\rangle$ denotes the inner product operator, and $\rho>0$ is a penalty parameter whose value will be determined later.

It may be tempting to directly apply the well-known ADMM method to the augmented Lagrangian (3), which alternatingly minimizes the primal variables $\mathbf{X}$ and $\mathbf{Y}$ , followed by a dual ascent step $\mathbf{\Lambda}\leftarrow\mathbf{\Lambda}+\rho(\mathbf{Y}-\mathbf{X})$ . Unfortunately, the classical result for ADMM presented in [24, 36, 37] only works for convex problems, hence they do not apply to our nonconvex problem (2) (note this is a linearly constrained nonconvex problem where the nonconvexity arises in the objective function). Recent results such as [38, 39, 40, 41] that analyze ADMM for nonconvex problems do not apply either, because in these works the basic requirements are: 1) the objective function is separable over the block variables; 2) the smooth part of the augmented Lagrangian has Lipschitz continuous gradient with respect to all variable blocks. Unfortunately neither of these conditions are satisfied in our problem.

Next we begin presenting the proposed algorithm. We start by considering the following reformulation of problem (1)

[TABLE]

where $\tau>0$ is some given constant.

Let $\mathbf{\Omega}^{*}$ denote the dual matrix for the constraint $\mathbf{X}\geq 0$ in the Lagrangian of problem (1). The KKT conditions of problem (1) are given by [42, eq. (5.49)]

[TABLE]

where $\circ$ denotes the Hadamard product. For a point $\mathbf{X}^{*}$ , if we can find some $\mathbf{\Omega}^{*}$ such that $(\mathbf{X}^{*},\mathbf{\Omega}^{*})$ satisfies conditions (5a)–(5d), then we term $\mathbf{X}^{*}$ a KKT point of problem (1).

A stationary point for problem (1) is a point $\mathbf{X}^{*}$ that satisfies the following optimality condition [43, Proposition 2.1.2]:

[TABLE]

It can be checked that when $\tau$ in (4) is sufficiently large (larger than a threshold dependent on $\mathbf{Z}$ ), then problem (4) is equivalent to problem (1), in the sense that the KKT points $\mathbf{X}^{*}$ of the two problems are identical. Also, there is a one-to-one correspondence between the KKT points and stationary points of the SymNMF problem, although in general such one-to-one correspondence may not hold. To be more precise, we have:

Lemma 1.

For problem (1), a point $\mathbf{X}^{*}$ , is a KKT point, which means there exists some $\mathbf{\Omega}^{*}$ such that $(\mathbf{X}^{*},\mathbf{\Omega}^{*})$ satisfies (5a)–(5d), if and only if $\mathbf{X}^{*}$ is a stationary point, which means it satisfies (6).

Proof:

See Section VII-A ∎

Lemma 2.

Suppose $\tau>\theta_{k},\forall k$ where

[TABLE]

then the KKT points of problem (1) and the KKT points of problem (4) have a one-to-one correspondence.

Proof:

See Section VII-B. ∎

We remark that the previous work [23] has made the observation that solving SymNMF with the additional constraints $\|\mathbf{X}_{i}\|_{2}\leq\sqrt{2\|\mathbf{Z}\|_{F}},\forall i$ will not result in any loss of the global optimality. Lemma 2 provides a stronger result, that all KKT points of SymNMF are preserved within a smaller bounded feasible set $\mathcal{Y}\triangleq\{\mathbf{Y}\mid\mathbf{Y}_{i}\geq 0,\|\mathbf{Y}_{i}\|^{2}_{2}\leq\tau,\forall i\}$ (note, that $\tau\ll 2\|\mathbf{Z}\|_{F}$ in general).

The proposed NS-SymNMF algorithm alternates between the primal updates of variables $\mathbf{X}$ and $\mathbf{Y}$ , and the dual update for $\mathbf{\Lambda}$ . Below we present its detailed steps (superscript $t$ is used to denote the iteration number).

[TABLE]

We remark that this algorithm is very close in form to the standard ADMM method applied to problem (4) (which lacks convergence guarantees). The key difference is the use of the proximal term $\|\mathbf{Y}-\mathbf{Y}^{(t)}\|_{F}^{2}$ multiplied by an iteration dependent penalty parameter $\beta^{(t)}\geq 0$ , whose value is proportional to the size of the objective value. Intuitively, if the algorithm converges to a solution with a small objective value, then parameter $\beta^{(t)}$ vanishes in the limit. Introducing such proximal term is one of the main novelty of the algorithm, and it is crucial in guaranteeing the convergence of NS-SymNMF.

III Convergence Analysis

In this section we provide convergence analysis of NS-SymNMF for a general SymNMF problem. We do not require $\mathbf{Z}$ to be symmetric, positive-semidefinite, or to have positive entries. We assume $K$ can be any integer in $[1,\;N]$ .

III-A Convergence and Convergence Rate

Below we present our first main result, which asserts that when the penalty parameter $\rho$ is sufficiently large, the NS-SymNMF algorithm converges globally to the set of KKT points of problem (1).

Theorem 1.

Suppose the following is satisfied

[TABLE]

Then the following statements are true for NS-SymNMF:

The equality constraint is satisfied in the limit, i.e.,

[TABLE] 2. 2.

The sequence $\{\mathbf{X}^{(t)},\mathbf{Y}^{(t)}\mathbf{\Lambda}^{(t)}\}$ generated by the algorithm is bounded. And every limit point of the sequence is a KKT point of problem (1).

An equivalent statement on the convergence is that the sequence $\{\mathbf{X}^{(t)},\mathbf{Y}^{(t)}\mathbf{\Lambda}^{(t)}\}$ converges to the set of KKT points of problem (1); cf. footnote 1 on Page 1.

Proof:

See Section VII-C. ∎

Our second result characterizes the convergence rate of the algorithm. To this end, we construct a function that measures the optimality of the iterates $\{\mathbf{X}^{(t)},\mathbf{Y}^{(t)},\mathbf{\Lambda}^{(t)}\}$ . Define the proximal gradient of the augmented Lagrangian function as

[TABLE]

where

[TABLE]

i.e., it is the projection operator that projects a given matrix $\mathbf{W}$ onto the feasible set of $\mathbf{Y}$ . Here we propose to use the following quantity to measure the progress of the algorithm

[TABLE]

It can be verified that if $\lim_{t\to\infty}\mathcal{P}(\mathbf{X}^{(t)},\mathbf{Y}^{(t)},\mathbf{\Lambda}^{(t)})=0$ , then a KKT point of problem (1) is obtained.

Below we show that the function $\mathcal{P}(\mathbf{X}^{(t)},\mathbf{Y}^{(t)},\mathbf{\Lambda}^{(t)})$ goes to zero in a sublinear manner.

Theorem 2.

For a given small constant $\epsilon$ , let $T(\epsilon)$ denote the iteration index satisfying the following inequality

[TABLE]

Then there exists some constant $C>0$ such that

[TABLE]

Proof:

See Section VII-D. ∎

The result indicates that it takes $\mathcal{O}(1/\epsilon)$ iterations for $\mathcal{P}(\mathbf{X}^{(t)},\mathbf{Y}^{(t)},\mathbf{\Lambda}^{(t)})$ to be less than $\epsilon$ . It follows that NS-SymNMF converges sublinearly.

III-B Sufficient Global and Local Optimality Conditions

Since problem (1) is not convex, the KKT points obtained by NS-SymNMF could be different from the global optimal solutions. Therefore it is important to characterize the conditions under which these two different types of solutions coincide. Below we provide an easily checkable sufficient condition to ensure that a KKT point $\mathbf{X}^{*}$ is also a globally optimal solution for problem (1).

Theorem 3.

Suppose that $\mathbf{X}^{*}$ is a KKT point of problem (1). Then, $\mathbf{X}^{*}$ is also a global optimal point if the following is satisfied

[TABLE]

Proof:

See Section VII-E. ∎

It is important to note that condition (17) is only a sufficient condition and hence may be difficult to satisfy in practice. In this section we provide a milder condition which ensures that a KKT point is locally optimal. This type of result is also very useful in practice since it can help identify spurious saddle points such as the point $\mathbf{X}^{*}=\mathbf{0}$ in the case where $\mathbf{Z}^{\scriptscriptstyle T}+\mathbf{Z}$ is not negative semidefinite.

We have the following characterization of the local optimal solution of the SymNMF problem.

Theorem 4.

Suppose that $\mathbf{X}^{*}$ is a KKT point of problem (1). Define a block matrix $\mathbf{T}\in\mathbb{R}^{KN\times KN}$ whose $(m,n)$ th block is a matrix of size $N\times N$ as follows

[TABLE]

where $\mathbf{S}$ is defined in (17), $\delta_{m,n}$ is the Kronecker delta function, and $\mathbf{X}^{\prime*}_{m}$ denotes the $m$ th column of $\mathbf{X}^{*}$ . If there exists some $\delta>0$ such that $\mathbf{T}\succ 0$ , then $\mathbf{X}^{*}$ is a strict local minimum solution of problem (1), meaning that there exists some $\epsilon>0$ small enough such that for all $\mathbf{X}\geq 0$ satisfying $\|\mathbf{X}-\mathbf{X}^{*}\|_{F}\leq\epsilon$ , we have

[TABLE]

Here the constant $\gamma$ is given by

[TABLE]

where $\lambda_{\min}(\mathbf{T})>0$ is the smallest eigenvalue of $\mathbf{T}$ .

Proof:

See Section VII-F. ∎

In the special case of $K=1$ , the sufficient condition set forth in Theorem 4 can be significantly simplified.

Corollary 1.

Suppose that $\mathbf{x}^{*}$ is the KKT point of problem (1) when $K=1$ . If there exists some $\delta>0$ such that

[TABLE]

then $\mathbf{x}^{*}$ is a strict local minimum point of problem (1).

Proof:

See Section VII-G. ∎

We comment that the condition given in Theorem 4 is much milder than that in Theorem 3. Further such condition is also very easy to check as it only involves finding the smallest eigenvalue of a $KN\times KN$ matrix for a given $\delta$ 222To find such smallest eigenvalue, we can find the largest eigenvalue of $\eta\mathbf{I}-\mathcal{T}$ , using algorithms such as the power method [14], where $\eta$ is sufficient large based on $\tau$ and $\|\mathbf{Z}\|_{F}$ .. In our numerical results (to be presented shortly), we set a series of consecutive $\delta$ when performing the test. We have observed that the solutions generated by NS-SymNMF satisfy the condition provided in Theorem 4 with high probability.

IV Implementation

In this section we discuss the implementation of the proposed algorithm.

IV-A The $\mathbf{X}$ -Subproblem

The subproblem for updating $\mathbf{X}^{(t+1)}$ in (9) is equivalent to the following problem

[TABLE]

where

[TABLE]

are two fixed matrices. Clearly problem (22) is just a least squares problem and can be solved in closed-form. The solution is given by

[TABLE]

We remark that the $\mathbf{A}^{(t+1)}_{\mathbf{X}}$ is a $K\times K$ matrix, where $K$ is usually small (e.g., the number of clusters for graph clustering applications). As a result, $\mathbf{X}^{(t+1)}$ in (24) can be obtained by solving a small system of linear equations and hence computationally cheap.

IV-B The $\mathbf{Y}$ -Subproblem

The $\mathbf{Y}$ -subproblem (8) can be decomposed into $N$ separable constrained least squares problems, each of which can be solved independently, and hence can be implemented in parallel. We may use the conventional gradient projection (GP) for solving each subproblem, using iterations

[TABLE]

where

[TABLE]

$\mathbf{Z}_{\mathbf{Y},i}$ denotes the $i$ th column of matrix $\mathbf{Z}_{\mathbf{Y}}$ , $\alpha$ is the step size, which is chosen either as a constant $1/\lambda_{\max}(\mathbf{A}^{(t)}_{\mathbf{Y}})$ , or by using some line search procedure [43]; $r$ denotes the iteration of the inner loop; for a given vector $\mathbf{w}$ , $\textsf{proj}_{\mathcal{Y}}(\mathbf{w})$ denotes the projection of it to the feasible set of $\mathbf{Y}_{i}$ , which can be evaluated in closed-form [44, pp. 80] as follows

[TABLE]

Other algorithms such as accelerated version of the gradient projection [45] can also be used to solve the $\mathbf{Y}$ -subproblem. It is also worth noting that when $\mathbf{Z}$ is sparse, the complexity of computing $\mathbf{Z}\mathbf{Y}^{(t+1)}$ in (23) and $(\mathbf{X}^{(t)})^{\scriptscriptstyle T}\mathbf{Z}$ in (26) is only proportional to the number of nonzero entries of $\mathbf{A}$ .

V Numerical Results

In this section, we compare the proposed algorithm with a few existing SymNMF solvers on both synthetic and real data sets. We run each algorithm with 20 random initializations (except for SNMF, which does not require external initialization). The entries of the initialized $\mathbf{X}$ (or $\mathbf{Y}$ ) follow an i.i.d. uniform distribution in the range $[0,\tau]$ . All algorithms are started with the same initial point each time, and all tests are performed using Matlab on a computer with Intel Core i5-5300U CPU running at 2.30GHz with 8GB RAM. Since the compared algorithms have different computational complexity, we use the objective values versus CPU time for fair comparison. We next describe different SymNMF solvers that are compared in our work.

Algorithms Comparison. In our numerical simulations, we compare the following algorithms.

Projected Gradient Descent (PGD) and Projected Newton method

(PNewton) [22, 11]

The PGD and PNewton directly use the gradient of the objective function. The key difference between them is that PGD adopts the identity matrix as a scaling matrix while PNewton exploits reduced Hessian for accelerating the convergence rate. The PGD algorithm converges slowly if the step size is not well selected, while the PNewton algorithm has high per-iteration complexity compared with ANLS and NS-SymNMF, due to the requirement of computing the Hessian matrix. Note that to the best of our knowledge, neither PGD nor PNewton possesses convergence or rate of convergence guarantees.

Alternating Nonnegative Least Square (ANLS) [11]

The ANLS method is a very competitive SymNMF solver, which can be implemented in parallel easily. ANLS reformulates SymNMF as

[TABLE]

where $\nu>0$ is the regularization parameter. One of shortcomings is that there is no theoretical guarantee that the ANLS method can converge to the set of KKT points of problem (1) or even producing two symmetric factors, although a penalty term for the difference between the factors ( $\mathbf{X}$ and $\mathbf{Y}$ ) is included in the objective.

Symmetric Nonnegative Matrix Factorization (SNMF) [10]

The SNMF algorithm transforms the original problem to another one under the assumption that $\mathbf{Z}$ can be exactly decomposed by $\mathbf{X}\mathbf{X}^{\scriptscriptstyle T}$ . Although SNMF often converges quickly in practice, there has been no theoretical analysis under the general case where $\mathbf{Z}$ cannot be exactly decomposed.

Coordinate Descent (CD) [23]

The CD method updates each entry of $\mathbf{X}$ in a cyclic way. For updating each entry, we only need to find the roots of a fourth-order univariate function. However, CD may not converge to the set of KKT points of SymNMF. Instead, there is an additional condition given in [23] for checking whether the generated sequence converges to a unique limit point. A heuristic method for checking the condition is additionally provided, which requires, e.g., plotting the norm between the different iterates.

The Proposed NS-SymNMF

The update rule of NS-SymNMF is similar to that of ANLS. The difference between them is that NS-SymNMF uses one additional block for dual variables and ANLS adds a penalty term. The dual update involved in NS-SymNMF benefits the convergence of the algorithm to KKT points of SymNMF.

We remark that in the implementation of NS-SymNMF we let $\tau=\max_{k}\theta_{k}$ (cf. (7)) and the maximum number of iterations of GP be $40$ . Also, we gradually increase the value of $\rho$ from an initial value to meet condition (12) for accelerating the convergence rate [46]. Here, the choice of $\rho$ follows $\rho^{(t+1)}=\min\{\rho^{(t)}/(1-\epsilon/\rho^{(t)}),6.1N\tau\}$ where $\epsilon=10^{-3}$ as suggested in [47]. We choose $\rho^{(1)}=\bar{\tau}$ for the case that $\mathbf{Z}$ can be exactly decomposed and $\sqrt{N}\bar{\tau}$ for the rest of cases, where $\bar{\tau}$ is the mean of $\theta_{k},\forall k$ . The similar strategy is also applied for updating $\beta^{(t)}$ . We choose $\beta^{(t)}=6\xi^{(t)}\|\mathbf{X}^{(t)}\mathbf{Y}^{(t)}-\mathbf{Z}\|^{2}_{F}/\rho^{(t)}$ where $\xi^{(t+1)}=\min\{\xi^{(t)}/(1-\epsilon/\xi^{(t)}),1\}$ and $\xi^{(1)}=0.01$ , and only update $\beta^{(t)}$ once every 100 iterations to save CPU time. To update $\mathbf{Y}$ , we implement the block pivoting method [17] since such method is faster than the GP method for solving the nonnegative least squares problem. If $\|\mathbf{Y}^{(t+1)}_{i}\|^{2}_{2}\leq\tau$ is not satisfied, then we switch to GP on $\mathbf{Y}^{(t)}_{i}$ . We also remark that we set the step size of PGD to $10^{-5}$ for all tested cases, and use the Matlab codes of PNewton and ANLS from http://math.ucla.edu/~dakuang/.

Performance on Synthetic Data. First we describe the two synthetic data sets that we have used in the first part of the numerical results.

Data set I (Random symmetric matrices): We randomly generate two types of symmetric matrices, one is of low rank and the other is of full rank.

For the low rank matrix, we first generate a matrix $\mathbf{M}$ with dimension $N\times K$ , whose entries follow an i.i.d. Gaussian distribution with zero mean and unit variance. We use $\mathbf{M}_{i,j}$ to denote the $(i,j)$ th entry of $\mathbf{M}$ . Then generate a new matrix $\widetilde{\mathbf{M}}$ whose $(i,j)$ th entry is $|\mathbf{M}_{i,j}|$ . Finally, we obtain a positive symmetric $\mathbf{Z}=\widetilde{\mathbf{M}}\widetilde{\mathbf{M}}^{\scriptscriptstyle T}$ as the given matrix to be decomposed.

For the full rank matrix, we first randomly generate a $N\times N$ matrix $\mathbf{P}$ , whose entries follow an i.i.d. uniform distribution in the interval $[0,1]$ . Then we compute $\mathbf{Z}=(\mathbf{P}+\mathbf{P}^{\scriptscriptstyle T})/2$ .

Data set II (Adjacency matrices): One important application of SymNMF is graph partitioning, where the adjacency matrix of a graph is factorized. We randomly generate a graph as follows. First, set the number of nodes to $N$ and the number of cluster to $4$ , and the numbers of nodes within each cluster to $300,500,800,400$ . Second, we randomly generate data points whose relative distance will be used to construct the adjacency matrix. Specifically, data points $\{x_{i}\}\in\mathbb{R}$ , $i=1,\ldots,N$ , are generated in one dimension. Within one cluster, data points follow an i.i.d. Gaussian distribution. The means of the random variables in these 4 clusters are $2,3,6,8$ , respectively, and the variance is 0.5 for all distributions. Construct the similarity matrix $\mathbf{A}\in\mathbb{R}^{N\times N}$ , whose $(i,j)$ th entry is $\mathbf{A}_{i,j}=\exp(-(x_{i}-x_{j})^{2}/(2\sigma^{2}))$ where $\sigma^{2}=0.5$ .

The convergence behaviors of different SymNMF solvers for the synthetic data sets are shown in Figure 1 and Figure 2. The results are averaged over 20 Monte Carlo (MC) trials with independently generated data. In Figure 1(a), the generated $\mathbf{Z}$ can be exactly decomposed by SymNMF. It can be observed that NS-SymNMF and SNMF converge to the global optimal solution quickly, and SNMF is the fastest one among all compared algorithms. However, the case where the matrix can be exactly factorized is not common in most practical applications. Hence, we also consider the case where matrix $\mathbf{Z}$ cannot be factorized exactly by a $N\times K$ matrix. The results are shown in Figure 1(b) and we use the relative objective value for comparison, i.e., $\|\mathbf{X}\mathbf{X}^{\scriptscriptstyle T}-\mathbf{Z}\|^{2}_{F}/\|\mathbf{Z}\|^{2}_{F}$ . We can observe that NS-SymNMF and CD can achieve a lower objective value than other methods. It is worth noting that there is a gap between SNMF and others, since the assumption of SNMF is not satisfied in this case.

We also implement the algorithms on the adjacency matrices (data set II), where the results are shown in Figure 2. The NS-SymNMF and SNMF algorithms converge very fast, but it can be observed that there is still a gap between SNMF and NS-SymNMF as shown in Figure 2(a). We further show the convergence rates with respective to optimality gap versus CPU time in Figure 2(b). The optimality gap (14) measures the closeness between the generated sequence and the true stationary point. To get rid of the effect of the dimension of $\mathbf{Z}$ , we use $\|\mathbf{X}-\textsf{proj}_{+}[\mathbf{X}-\nabla_{\mathbf{X}}(f(\mathbf{X}))]\|_{\infty}$ as the optimality gap. It is interesting to see the “swamp” effect [48], where the objective value generated by the CD algorithm remains almost constant during the time period from around 25s to 75s although actually the corresponding iterates do not converge, and then the objective value starts decreasing again.

Checking Global/Local Optimality. After the NS-SymNMF algorithm has converged, the local/global optimality can be checked according to Theorem 3 and Theorem 4. To find an appropriate $\delta$ that satisfying the condition where $\lambda_{\min}(\mathbf{T})>0$ , we initialize $\delta$ as 1 and decrease it by $0.01$ each time and check the minimum eigenvalue of $\mathbf{T}$ . Here, we use data set II with the fixed ratio of the number of nodes within each cluster (i.e., $3:5:8:4$ ) and test on the different total numbers of nodes. The simulation results are shown in Table I with 100 MC trials, where the average value of $\lambda_{\min}(\mathbf{T})$ and $\delta$ are given. Further, the percentage of being able to find a valid $\delta>0$ that ensures $\lambda_{\min}(\mathbf{T})>0$ is listed as the last column. We note that there always existed a $\delta$ such that $\mathbf{T}$ is positive definite in all cases that we tested. This indicates that (with high probability) the proposed algorithm converges to a locally optimal solution. In Figure 3, we provide the values of $\delta$ that make the corresponding $\lambda_{\min}(\mathbf{T})>0$ at each realization.

We also remark that in practice we stop the algorithm in finite steps, so only an approximate KKT point will be obtained, and the degree of such approximation can be measured by the optimality gap defined in (14).

Performance on Real Data. We also implement the algorithm on a few real data sets in clustering applications, which will be described in the next paragraphs.

V-1 Dense Similarity Matrix

we generate the dense similarity matrices based on the two real data sets: Reuters-21578 and TDT2 [49]. We use the 10th subset of the processed Reuters-21578 data set, which includes $N=4,633$ documents divided into $K=25$ classes. The number of features is 18,933. Topic detection and tracking 2 (TDT2) corpus includes two newswires (APW and NYT), two radio programs (VOA and PRI) and two television programs (CNN and ABC). We use the 10th subset of the processed TDT2 data set with $K=25$ classes which includes $N=8,939$ documents and each of them has 36,771 features. We comment that the 10th TDT2 subset is the largest among the all TDT2 and Reuters subsets. Any other subset can be used equally well. The similarity matrix is constructed by the Gaussian function where the difference between two documents is measured by all features using the Euclidean distance [49].

The means and standard deviations of the objective values of the final solutions are shown in Table II. Convergence results of the algorithms are shown in Figure 4. For the Reuters and TDT2 datasets, before SNMF completes the eigenvalue decomposition for the first iteration, CD and NS-SymNMF have already obtained low objective values. Also, since calculating Hessian in PNewton is time consuming, the result of PNewton is out of range in Figure 4(b).

V-2 Sparse Similarity Matrix

we also generate multiple convergence curves for each algorithm with random initializations based on some sparse real data sets.

Email-Enron network data set [50]: Enron email corpus includes around half million emails. We use the relationships between two email addresses to construct the similarity matrix for decomposing. If an address $i$ sent at least one email to address $j$ , then we take $\mathbf{A}_{i,j}=\mathbf{A}_{j,i}=1$ . Otherwise, we set $\mathbf{A}_{i,j}=\mathbf{A}_{j,i}=0$ .

Brightkite data set [51]: Brightkite was a location-based social networking website. Users were able to share their current locations by checking-in. The friendships of the users were maintained by Brightkite. The way of constructing the similarity matrix is the same as the Enron email data set.

The means and standard deviations of the objective values of the final solutions are shown in Table III. From the simulation results shown in Figure 5, it can be observed that the NS-SymNMF algorithm converges faster than CD, while SNMF and ANLS converge to some points where the relative objective values are higher than the one obtained by NS-SymNMF.

VI Conclusions

In this paper, we propose a nonconvex splitting algorithm for solving the SymNMF problem. We show that the proposed algorithm converges to a KKT point in a sublinear manner. Further, we provide sufficient conditions to identify global or local optimal solutions of the SymNMF problem. Numerical experiments show that the proposed method can converge quickly to local optimal solutions.

In the future, we plan to extend the proposed methods in a way such that the algorithms can converge to the local or even global optimal solutions of SymNMF without requiring checking conditions. Also, it is possible to apply the nonconvex splitting method to more general matrix factorization problems, such as the quadratic nonnegative matrix factorization problem [52].

VII Appendix

VII-A Proof of Lemma 1

Sufficiency: the stationary points satisfy

[TABLE]

Let $\mathbf{\Omega}\triangleq(\mathbf{X}^{*}(\mathbf{X}^{*})^{\scriptscriptstyle T}-(\mathbf{Z}^{\scriptscriptstyle T}+\mathbf{Z})/2)\mathbf{X}^{*}/2$ . We have $\langle\mathbf{\Omega},\mathbf{X}-\mathbf{X}^{*}\rangle\geq 0,\forall\mathbf{X}\geq 0$ . By setting $\mathbf{X}$ appropriately as $0\leq\mathbf{X}\leq\mathbf{X}^{*}$ , we have $\mathbf{\Omega}_{i,j}\geq 0,(i,j)\in\mathcal{S}$ where $\mathcal{S}=\{i,j|\mathbf{X}^{*}_{i,j}\neq 0\}$ . Also, by setting $\mathbf{X}$ appropriately as $\mathbf{X}\geq\mathbf{X}^{*}$ , we have $\mathbf{\Omega}_{i,j}\geq 0,(i,j)\notin\mathcal{S}$ . Combining the two cases, we conclude that $\mathbf{\Omega}\geq 0$ .

From (30), we know that $\langle\mathbf{\Omega},\mathbf{X}\rangle\geq\langle\mathbf{\Omega},\mathbf{X}^{*}\rangle$ . Since $\mathbf{\Omega}\geq 0$ and $\mathbf{X}\geq 0$ , we have $\langle\mathbf{\Omega},\mathbf{X}\rangle\geq 0,\forall\mathbf{X}$ , meaning that $\langle\mathbf{\Omega},\mathbf{X}^{*}\rangle\leq 0$ . Combining with $\mathbf{X}^{*}\geq 0$ and $\mathbf{\Omega}\geq 0$ , we have $\langle\mathbf{\Omega},\mathbf{X}^{*}\rangle\geq 0$ , which results in $\langle\mathbf{\Omega},\mathbf{X}^{*}\rangle=0$ .

In summary, we have

[TABLE]

which are the KKT conditions of the SymNMF problem.

Necessity: If the point is a KKT point of SymNMF, we have

[TABLE]

Combining with $\langle\mathbf{X}^{*},\mathbf{\Omega}^{*}\rangle=0$ , we know that

[TABLE]

which is the condition of stationary points.

VII-B Proof of Lemma 2

We prove that if $\tau$ is large enough, then the KKT conditions of (1) and (4) are the same.

Proof:

It is sufficient to show that when $\tau$ is large enough, there can be no KKT point whose column has size $\tau$ , leading to the fact that the constraint $\|\mathbf{X}^{*}_{k}\|^{2}\leq\tau$ is always inactive.

We check the optimality condition of the SymNMF problem at $\|\mathbf{X}^{*}_{k}\|^{2}=\tau_{k}$ , where $\tau_{k}>0$ is a constant. We can rewrite the objective function as

[TABLE]

Note, $\mathbf{X}_{i},\mathbf{X}_{j},\mathbf{X}_{k}$ denote rows of matrix $\mathbf{X}$ .

We take the gradient of $f(\mathbf{X})$ with respective to $\mathbf{X}_{k}$ :

[TABLE]

where $\mathbf{X}_{i,m}$ denotes the $m$ th entry of the $i$ th row of $\mathbf{X}$ .

Assume that $\mathbf{X}^{*}_{k}$ is a KKT point. We have $(\frac{\partial f(\mathbf{X}^{*}_{k})}{\partial\mathbf{X}_{k}})(\mathbf{X}_{k}-\mathbf{X}^{*}_{k})^{\scriptscriptstyle T}\geq 0,\forall~{}\mathbf{X}_{k}\in\mathcal{X}$ , where $\mathcal{X}=\{\mathbf{X}_{k}|\mathbf{X}_{k}\geq 0,\|\mathbf{X}_{k}\|^{2}\leq\tau_{k}\}$ , which implies

[TABLE]

Since $\|\mathbf{X}^{*}_{k}\|^{2}=\tau_{k}$ , there exists an index $m$ such that $\mathbf{X}^{*}_{k,m}>0$ . Consider a feasible point $0\leq\mathbf{X}_{k,m}<\mathbf{X}^{*}_{k,m}$ , where $m\in\mathcal{S}_{m}\triangleq\{m|\mathbf{X}^{*}_{k,m}\neq 0\}$ . Thanks to (VII-B), we have

[TABLE]

Plugging (34) into (36) and multiplying $\mathbf{X}^{*}_{k,m}$ on both sides of (36), we can obtain

[TABLE]

For the case $m\notin\mathcal{S}_{m}$ , we know that $\mathbf{X}^{*}_{k,m}=0$ . Summing up (37) $\forall m$ , and noting that $|\mathcal{S}_{m}|\geq 1$ we can get

[TABLE]

In (38), $\mathcal{M}_{i,k}$ is a quadratic function with respective to $C_{i,k}$ , where $C_{i,k}\triangleq\mathbf{X}^{*}_{i}(\mathbf{X}^{*}_{k})^{\scriptscriptstyle T}$ , so the minimum of $\mathcal{M}_{i,k}$ is $-1/4((\mathbf{Z}_{i,k}+\mathbf{Z}_{k,i})/2)^{2}$ . Consequently, the minimum of $\sum^{N}_{i=1,i\neq k}\mathcal{M}_{i,k}$ is $-1/4\sum^{N}_{i=1,i\neq k}((\mathbf{Z}_{i,k}+\mathbf{Z}_{k,i})/2)^{2}$ .

In addition, since we have $\|\mathbf{X}^{*}_{k}\|^{2}=\tau_{k}$ , the lower bound of $p$ is $p_{\textsf{L}}\triangleq-1/4\sum^{N}_{i=1,i\neq k}((\mathbf{Z}_{i,k}+\mathbf{Z}_{k,i})/2)^{2}+\tau_{k}(\tau_{k}-\mathbf{Z}_{k,k})$ which is a quadratic function in terms of $\tau_{k}$ . Therefore, if

[TABLE]

then $p\geq p_{\textsf{L}}>0$ , which contradicts the optimality condition (37). It can be concluded that whenever $\tau_{k}$ is large enough, at any KKT point no column will have size equal to $\tau_{k}$ . Furthermore, it can be easily checked that $\tau>\max_{k}\theta_{k}$ is a sufficient condition. The proof is complete. ∎

VII-C Convergence Proof of the Proposed Algorithm

In this section, we prove Theorem 1. The analysis consists of a series of lemmas.

Lemma 3.

Consider using the update rules (8) – (10) to solve problem (1). Then we have

[TABLE]

Proof:

The optimality condition of the $\mathbf{X}$ subproblem (9) is given by

[TABLE]

Substituting (10) into (41), we have

[TABLE]

Subtracting the same equation in iteration $t$ , we have the successive difference of the dual matrix (VII-C), shown at the top of the next page.

Note that the following is true

[TABLE]

Plugging (45) into (VII-C), we have

[TABLE]

Using triangle inequality, we arrive at

[TABLE]

Since $\|\mathbf{Y}_{i}\|^{2}\leq\tau$ , we know that $\|\mathbf{Y}\|_{F}\leq\sqrt{N\tau}$ . Squaring both sides of (VII-C), we obtain

[TABLE]

The claim is proved. ∎

In the second step, we bound the successive difference of the augmented Lagrangian.

Lemma 4.

Consider using the update rules (8)–(10). If

[TABLE]

we have

[TABLE]

where $c_{1},c_{2},c_{3}>0$ are some positive constants.

Proof:

Let

[TABLE]

which is an upper bound of $\mathcal{L}(\mathbf{X}^{(t)},\mathbf{Y},\mathbf{\Lambda}^{(t)})$ , and

[TABLE]

We have the following descent estimate

[TABLE]

Next we bound the quantities in (52)

[TABLE]

where $(a)$ is due to the fact that Taylor expansion for quadratic problems is exact, and $(b)$ is due to the optimality condition for problem (8). Similarly, we have

[TABLE]

where $(a)$ is from (10).

Substituting the result of Lemma 3 into (54), we can obtain

[TABLE]

Therefore, from (VII-C) if $\frac{\rho}{2}-\frac{3N^{2}\tau^{2}}{\rho}>0$ , $\frac{1}{2}-\frac{3N\tau}{\rho}>0$ , and

[TABLE]

which are equivalent to

[TABLE]

then $\mathcal{L}(\mathbf{X}^{(t+1)},\mathbf{Y}^{(t+1)},\mathbf{\Lambda}^{(t+1)})-\mathcal{L}(\mathbf{X}^{(t)},\mathbf{Y}^{(t)},\mathbf{\Lambda}^{(t)})<0$ .

Then, it is concluded that $\mathcal{L}(\mathbf{X}^{(t+1)},\mathbf{Y}^{(t+1)},\mathbf{\Lambda}^{(t+1)})$ is decreasing. ∎

In the next step we prove that $\mathcal{L}(\mathbf{X}^{(t+1)},\mathbf{Y}^{(t+1)},\mathbf{\Lambda}^{(t+1)})$ is lower bounded.

Lemma 5.

Consider using the update rules (8) (9) (10). If $\rho\geq N\tau$ is satisfied, we have

[TABLE]

Proof:

At iteration $t+1$ , the augmented Lagrangian can be lower bounded as

[TABLE]

where $(a)$ is due to (42), and $(b)$ is true because

[TABLE]

and $\|\mathbf{Y}\|^{2}_{F}\leq N\tau$ .

From (59), we know that if $\rho\geq N\tau$ , we have $\mathcal{L}(\mathbf{X}^{(t+1)},\mathbf{Y}^{(t+1)},\mathbf{\Lambda}^{(t+1)})\geq 0$ . ∎

These lemmas lead to the main convergence claim.

Proof:

Combing (50) and (58), we have

[TABLE]

By Lemma 3, we have

[TABLE]

which implies $\lim_{t\to\infty}\|\mathbf{X}^{(t)}-\mathbf{Y}^{(t)}\|^{2}_{F}=0$ . Combining with (60), we can further know that $\lim_{t\to\infty}\|\mathbf{Y}^{(t+1)}-\mathbf{Y}^{(t)}\|^{2}_{F}=0$ . The boundedness assumption of $\mathbf{X}^{(t)}$ then follows from the boundedness of $\mathbf{Y}^{(t)}$ . Using the expression of $\mathbf{\Lambda}^{(t)}$ in (42), one can show that $\{\mathbf{\Lambda}^{(t)}\}$ is also bounded.

The optimality condition of (8) is given by

[TABLE]

Substituting (42) into (62), using (60), and taking limit over any converging subsequence of $\{\mathbf{X}^{(t)},\mathbf{Y}^{(t)},\mathbf{\Lambda}^{(t)}\}$ , we have

[TABLE]

The optimality condition of (9) is given by

[TABLE]

Taking limit of (64) over the same subsequence, we have

[TABLE]

Using the fact $\mathbf{X}^{*}=\mathbf{Y}^{*}$ , we have

[TABLE]

which are the KKT conditions of problem (1). ∎

VII-D Convergence Rate Proof of the Proposed Algorithm

Proof:

Based on Theorem 1, $\|\mathbf{X}^{(t)}\|^{2}_{F}$ is bounded. There must exist a finite $\gamma>0$ such that $\|\mathbf{X}^{(t)}\|^{2}_{F}\leq N\gamma,\forall t$ , where $\gamma$ is only dependent on $\tau$ , $N$ and $\|\mathbf{Z}\|_{F}$ .

From the optimality condition of $\mathbf{Y}$ in (8), we have

[TABLE]

Then, we have

[TABLE]

where $\textsf{proj}_{\mathcal{Y}}$ denotes the projection of $\mathbf{Y}$ to the feasible space; in $(a)$ we used triangle inequality; $(b)$ is due to the nonexpansiveness of the projection operator; and $(c)$ is due to the boundedness of $\|\mathbf{X}\|_{F}$ .

Similarly, we can bound the size of the gradient of the augmented Lagrangian with respect to $\mathbf{X}$ by the following series of inequalities

[TABLE]

where $(a)$ is from the optimality condition of the $\mathbf{X}$ -subproblem (41); $(b)$ is true due to (43) and (42). Squaring both sides of (70) and applying Lemma 3, we have

[TABLE]

Due to the boundedness of $\mathbf{X}^{(t)}$ and $\mathbf{Y}^{(t)}$ , we must have that for some $\delta>0$ , $\|\mathbf{X}^{(t)}(\mathbf{Y}^{(t)})^{\scriptscriptstyle T}-\mathbf{Z}\|_{F}\leq\delta$ .

Therefore, combining (68) and (71), there must exists a finite positive number $\sigma_{1}$ such that

[TABLE]

where

[TABLE]

In particular, we have $\sigma_{1}\triangleq\max\{3(3N^{2}\tau^{2}+\rho^{2}),3(2+\rho+\beta^{(t)})^{2}+3(3\delta^{2}+\rho^{2}),3\gamma+9N\tau\}$ and $\beta^{(t)}\leq 6\delta^{2}/\rho$ .

According to Lemma 3, we have

[TABLE]

where some constant $\sigma_{2}\triangleq\max\{3N^{2}\tau^{2}/\rho^{2},3\delta^{2}/\rho^{2},3N\tau/\rho^{2}\}$ .

Also, we have

[TABLE]

which yields

[TABLE]

for $\sigma_{3}\triangleq\max\{9N^{2}\tau^{2}/\rho^{2}+3,9\delta^{2}/\rho^{2}+3,9N\tau/\rho^{2}\}$ .

The inequalities (72) and (76) imply that

[TABLE]

According to Lemma 4, there exists a constant $\sigma_{4}\triangleq\min\{c_{1},c_{2},c_{3}\}$ such that

[TABLE]

Combining (77) and (78), we have

[TABLE]

Summing both sides of (79) over $t=1,\ldots,r$ , we have

[TABLE]

where $(a)$ is due to Lemma 5.

According to the definition of $T(\epsilon)$ and $\mathcal{P}(\mathbf{X}^{(t)},\mathbf{Y}^{(t)},\mathbf{\Lambda}^{(t)})$ , the above inequality becomes

[TABLE]

Dividing both sides by $T(\epsilon)$ , and by setting $C\triangleq(\sigma_{1}+\sigma_{3})/\sigma_{4}$ , the desired result is obtained. ∎

VII-E Sufficient Condition of Global Optimality

Proof:

Let $\mathbf{\Omega}$ be the Lagrange multipliers matrix. The Lagrangian of problem (1) is given by

[TABLE]

Let $(\mathbf{X}^{*},\mathbf{\Omega}^{*})$ be a KKT point of problem (1). To show global optimality of $(\mathbf{X}^{*},\mathbf{\Omega}^{*})$ , it is sufficient to prove the following saddle point condition [42, pp. 238]

[TABLE]

To show the left hand side of (83), we have the following

[TABLE]

where $(a)$ is due to (5d), and $(b)$ is due to $\mathbf{\Omega}\geq 0$ and (5c).

Next we show the right hand side of (83)

[TABLE]

where $(a)$ is due to $\mathcal{M}\geq 0$ and the fact that

[TABLE]

$(b)$ is true because of (5a). Clearly, if we have $\mathbf{S}\succeq 0$ , then the following inequality must be true

[TABLE]

This completes the proof. ∎

VII-F Sufficient Condition of Local Optimality

Proof:

We first simplify the term $\mathcal{M}$ in (VII-E) as follows.

[TABLE]

where $(a)$ is due to the fact that

[TABLE]

in $(b)$ we defined $\widehat{\mathbf{Y}}\triangleq\mathbf{X}-\mathbf{X}^{*}$ which shows the difference between $\mathbf{X}$ and $\mathbf{X}^{*}$ ; and in $(c)$ we defined $\mathbf{U}\triangleq\widehat{\mathbf{Y}}\widehat{\mathbf{Y}}^{\scriptscriptstyle T}=\mathbf{U}^{\scriptscriptstyle T}$ .

Combining (86) and (93), we have

[TABLE]

where

[TABLE]

and $\widetilde{\mathcal{K}}_{m,n}\triangleq\mathbf{X}^{\prime*}_{n}(\mathbf{X}^{\prime*}_{m})^{\scriptscriptstyle T}$ , $(m,n)$ denotes the $(m,n)$ th block of a matrix, $\mathbf{X}^{\prime*}_{m}$ ( $\widehat{\mathbf{Y}}^{\prime}_{n}$ ) denotes the $m$ th (or $n$ th) column of matrix $\mathbf{X}^{*}$ (or $\widehat{\mathbf{Y}}$ ).

For the $(m,n)$ th block, we have

[TABLE]

where

[TABLE]

$\delta_{m,n}$ is the Kronecker delta function, and $\mathbf{T}_{m,n}$ is the $(m,n)$ th block of matrix $\mathbf{T}$ , and $(a)$ we use triangle inequality and $\delta>0$ is any positive number; $(b)$ we use Cauchy-Schwarz inequality.

If there exists $\delta$ such that $\mathbf{T}$ is positive definite, then $\mathbf{X}^{*}$ is a strict local minimum point of problem (1). That is, there exist some $\gamma,\epsilon>0$ such that

[TABLE]

where $\gamma$ is given by

[TABLE]

where $\lambda_{\min}(\mathbf{T})$ is the smallest eigenvalue of matrix $\mathbf{T}$ . Clearly $\gamma$ can be made positive for sufficiently small $\epsilon$ .

According to the definition of Lagrangian (82), we have

[TABLE]

Combing with (96) and KKT conditions (5b)–(5d), we can obtain

[TABLE]

Therefore $\mathbf{X}^{*}$ is a strict local minimum point of problem (1). ∎

VII-G Sufficient Local Optimality Condition When $K=1$

Proof:

The term $\mathcal{M}$ is as follows.

[TABLE]

When $K=1$ , (105) becomes

[TABLE]

where $\mathbf{x}^{*}$ and $\widehat{\mathbf{y}}$ denote the column of matrix $\mathbf{X}^{*}$ and $\widehat{\mathbf{Y}}$ .

Combining with (86), we have

[TABLE]

where in $(a)$ we have used the triangle inequality and $\delta>0$ is any positive number.

If there exists $\delta>0$ which ensures that $\mathbf{T}_{1}\succ 0$ , then there exist some $\gamma,\epsilon>0$ such that the following is true

[TABLE]

In the above inequality, the constant $\gamma$ is given by

[TABLE]

where $\lambda_{\min}(\mathbf{T}_{1})$ denotes the smallest eigenvalue of $\mathbf{T}_{1}$ . Clearly $\gamma$ can be made positive by setting $\epsilon$ sufficiently small.

According to the definition of the Lagrangian, we have

[TABLE]

Therefore, combining with (112) and the KKT conditions, we can obtain

[TABLE]

∎

Bibliography52

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] S. L. Campbell and G. D. Poole, “Computing nonnegative rank factorizations,” Linear Algebra and its Applications , vol. 35, pp. 175–182, Feb. 1981.
2[2] P. Paatero and U. Tapper, “Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values,” Environmetrics , vol. 5, no. 2, pp. 111–126, June 1994.
3[3] N. Gillis and S. A. Vavasis, “Fast and robust recursive algorithmsfor separable nonnegative matrix factorization,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 36, no. 4, pp. 698–714, Apr. 2014.
4[4] Y.-X. Wang and Y.-J. Zhang, “Nonnegative matrix factorization: A comprehensive review,” IEEE Transactions on Knowledge and Data Engineering , vol. 25, no. 6, pp. 1336–1353, June 2013.
5[5] P. O. Hoyer, “Non-negative matrix factorization with sparseness constraints,” Journal of Machine Learning Research , vol. 5, pp. 1457–1469, 2004.
6[6] D. D. Lee and H. S. Seung, “Algorithms for non-negative matrix factorization,” in Proc. of Neural Information Processing Systems (NIPS) , pp. 556–562, 2001.
7[7] B. Yang, X. Fu, and N. D. Sidiropoulos, “Joint factor analysis and latent clustering,” in Proc. of IEEE Int. Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP) , pp. 173–176, Dec. 2015.
8[8] N. Gillis, “The why and how of nonnegative matrix factorization,” in Regularization, Optimization, Kernels, and Support Vector Machines . Chapman & Hall/CRC, Machine Learning and Pattern Recognition Series, 2014.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

A Nonconvex Splitting Method for Symmetric Nonnegative Matrix Factorization:

Abstract

Index Terms:

I Introduction

I-A Related Work

I-B Contributions

II The Proposed Algorithm

Lemma 1**.**

Proof:

Lemma 2**.**

Proof:

III Convergence Analysis

III-A Convergence and Convergence Rate

Theorem 1**.**

Proof:

Theorem 2**.**

Proof:

III-B Sufficient Global and Local Optimality Conditions

Theorem 3**.**

Proof:

Theorem 4**.**

Proof:

Corollary 1**.**

Proof:

IV Implementation

IV-A The X\mathbf{X}X-Subproblem

IV-B The Y\mathbf{Y}Y-Subproblem

V Numerical Results

Projected Gradient Descent (PGD) and Projected Newton method

Alternating Nonnegative Least Square (ANLS) [11]

Symmetric Nonnegative Matrix Factorization (SNMF) [10]

Coordinate Descent (CD) [23]

The Proposed NS-SymNMF

V-1 Dense Similarity Matrix

V-2 Sparse Similarity Matrix

VI Conclusions

VII Appendix

VII-A Proof of Lemma 1

VII-B Proof of Lemma 2

Proof:

VII-C Convergence Proof of the Proposed Algorithm

Lemma 3**.**

Proof:

Lemma 4**.**

Proof:

Lemma 5**.**

Proof:

Proof:

VII-D Convergence Rate Proof of the Proposed Algorithm

Proof:

VII-E Sufficient Condition of Global Optimality

Proof:

VII-F Sufficient Condition of Local Optimality

Proof:

VII-G Sufficient Local Optimality Condition When K=1K=1K=1

Proof:

Lemma 1.

Lemma 2.

Theorem 1.

Theorem 2.

Theorem 3.

Theorem 4.

Corollary 1.

IV-A The $\mathbf{X}$ -Subproblem

IV-B The $\mathbf{Y}$ -Subproblem

Lemma 3.

Lemma 4.

Lemma 5.

VII-G Sufficient Local Optimality Condition When $K=1$