Clustering by Orthogonal NMF Model and Non-Convex Penalty Optimization

Shuai Wang; Tsung-Hui Chang; Ying Cui; and Jong-Shi Pang

arXiv:1906.00570·cs.LG·July 29, 2021

Clustering by Orthogonal NMF Model and Non-Convex Penalty Optimization

Shuai Wang, Tsung-Hui Chang, Ying Cui, and Jong-Shi Pang

PDF

1 Repo

TL;DR

This paper introduces a novel non-convex penalty approach for orthogonal NMF clustering that transforms orthogonality constraints into norm-based penalties, enabling scalable and efficient solutions that outperform existing methods.

Contribution

The paper proposes a new formulation of orthogonal NMF using non-convex penalties, along with efficient algorithms and theoretical conditions for feasible solutions.

Findings

01

The proposed NCP methods are computationally efficient.

02

They match or outperform existing clustering methods.

03

Experimental results validate the effectiveness of the approach.

Abstract

The non-negative matrix factorization (NMF) model with an additional orthogonality constraint on one of the factor matrices, called the orthogonal NMF (ONMF), has been found a promising clustering model and can outperform the classical K-means. However, solving the ONMF model is a challenging optimization problem because the coupling of the orthogonality and non-negativity constraints introduces a mixed combinatorial aspect into the problem due to the determination of the correct status of the variables (positive or zero). Most of the existing methods directly deal with the orthogonality constraint in its original form via various optimization techniques, but are not scalable for large-scale problems. In this paper, we propose a new ONMF based clustering formulation that equivalently transforms the orthogonality constraint into a set of norm-based non-convex equality constraints. We…

Tables5

Table 1. TABLE I: Average clustering performance (%) and CPU time (s) on the synthetic data for different values of SNR.

SNR (dB)		-5	-3	-1	1	3
ACC	KM	63.4	69.7	74.7	74.3	75.6
	KM++	64.1	70.1	70.3	70.8	69.2
	DTPP	81.6	85.1	85.9	87.0	86.6
	ONP-MF	66.8	88.3	83.1	89.9	90.0
	ONMF-S	77.4	79.0	80.2	81.3	81.7
	HALS	76.3	86.0	88.1	89.6	89.4
	SNCP	91.5	91.9	92.0	92.5	92.8
	NSNCP	90.1	90.7	91.8	92.0	92.2
Time	KM	3.10	1.96	1.52	1.16	1.11
	KM++	3.09	2.44	2.05	1.70	1.59
	DTPP	30.4	34.0	38.6	31.1	35.1
	ONP-MF	1097	1123	1124	1153	1148
	ONMF-S	68.4	97.1	116	116	90.5
	HALS	22.6	25.6	23.3	15.7	19.3
	SNCP	20.6	15.6	13.4	13.3	12.4
	NSNCP	14.8	15.5	11.5	10.5	10.6

Table 2. TABLE II: Average iteration number on the synthetic datasets for different values of SNR.

SNR (dB)		-5	-3	-1	1	3
#Iteration	DTPP	2000	2000	2000	2000	2000
	ONMF-S	2000	2000	2000	2000	2000
	HALS	1931	1788	1521	1442	1325
	SNCP	1921	1601	1401	1310	1260
	NSNCP	1478	1322	1130	1063	1047

Table 3. TABLE III: Average clustering performance ( % percent \% ) and CPU time (s) on the TCGA data.

Dataset		1	2	3	4	5
#samples $N$		1667	3086	3660	5314	11135
#cancers $K$		5	10	15	20	33
ACC	KM	75.0	67.0	57.0	52.2	34.4
	KM++	75.5	55.3	53.7	48.4	34.5
	DTPP	79.2	58.5	58.0	57.0	43.1
	ONMF-S	85.8	71.8	61.9	58.2	38.4
	ONP-MF	84.0	68.5	48.1	32.9	14.8
	HALS	86.2	68.8	56.1	56.3	39.1
	SNCP	85.6	79.3	64.0	61.1	41.1
	NSNCP	89.2	81.2	68.2	64.3	42.7
Time	KM	3.02	7.54	12.3	34.7	115
	KM++	3.08	7.83	26.9	40.3	331
	DTPP	249	434	1357	2135	2454
	ONMF-S	89.9	772	1440	2492	19794
	ONP-MF	4886	10414	14232	16092	77222
	HALS	2.26	11.4	98.1	260	1605
	SNCP	41.6	249	307	486	1118
	NSNCP	22.1	153	240	526	1756

Table 4. TABLE IV: Average clustering performance ( % percent \% ) and CPU time (s) on TDT2 data.

Dataset		1	2	3	4	5	6
#terms $M$		13133	24968	11079	20431	16067	29725
#docs $N$		842	3292	631	1745	1079	4779
ACC	KM	77.9	51.7	84.4	49.3	61.3	70.2
	KM++	73.5	48.9	84.7	47.1	61.7	65.7
	DTPP	70.0	50.4	77.4	52.1	45.1	66.6
	ONMF-S	81.8	52.0	83.6	59.6	50.9	68.8
	ONP-MF	85.3	46.1	89.5	60.9	59.4	66.8
	HALS	77.9	47.8	85.1	54.3	48.7	64.4
	SNCP	86.1	56.8	88.1	62.0	58.2	70.7
	NSNCP	79.7	54.1	88.6	60.8	56.4	64.8
Time	KM	13.6	195	6.82	65.0	27.4	318
	KM++	10.6	199	8.15	53.6	28.9	401
	DTPP	126	726	81	413	204	2282
	ONMF-S	405	8124	157	1915	633	17603
	ONP-MF	1200	7854	757	3660	1951	14302
	HALS	12.9	35.6	6.07	27.4	20.0	162
	SNCP	37.3	407	20.4	64.0	41.1	342
	NSNCP	24.7	335	16.3	50.0	37.6	424

Table 5. TABLE V: Average clustering performance ( % percent \% ) and CPU time (s) on dimension-reduced TDT2 data

Dataset		1	2	3	4	5	6
Time of SC		13.8	402	7.11	90.7	30.7	1114
ACC	KM	81.6	69.0	83.1	86.3	89.1	80.2
	KM++	98.3	83.3	98.9	99.6	98.9	90.3
	HALS	95.8	82.8	97.8	98.8	99.4	90.8
	ONP-MF	98.6	82.3	99.0	99.3	99.5	89.7
	SNCP	99.0	84.3	98.0	99.2	99.3	92.7
Time	KM	0.24	1.34	0.17	0.45	0.33	2.33
	KM++	0.25	1.96	0.18	0.46	0.34	2.18
	HALS	0.78	2.45	0.81	1.65	0.75	1.52
	ONP-MF	13.6	385	7.29	90.0	26.0	790
	SNCP	0.59	2.00	0.55	1.31	1.21	3.17

Equations188

W, H min

W, H min

s.t.

W \geq 0,

W, H min

W, H min

s.t.

W, H min

W, H min

s.t.

H H^{⊤} = I_{K} .

∥ x ∥_{p} = ∥ x ∥_{q}, 1 \leq p < q .

∥ x ∥_{p} = ∥ x ∥_{q}, 1 \leq p < q .

\displaystyle\left\{{\bf H}\geq{\bm{0}}~{}\bigg{|}\begin{array}[]{ll}&\|\widetilde{}{\bf h}_{i}\|_{2}=1,~{}i\in{\mathcal{K}},\\ &\|{\bf h}_{j}\|_{p}=\|{\bf h}_{j}\|_{q},~{}j\in{\mathcal{N}}\end{array}\right\}.

\displaystyle\left\{{\bf H}\geq{\bm{0}}~{}\bigg{|}\begin{array}[]{ll}&\|\widetilde{}{\bf h}_{i}\|_{2}=1,~{}i\in{\mathcal{K}},\\ &\|{\bf h}_{j}\|_{p}=\|{\bf h}_{j}\|_{q},~{}j\in{\mathcal{N}}\end{array}\right\}.

W, H min

W, H min

s.t.

∥ h_{i} ∥_{2} = 1, i \in K,

∥ h_{j} ∥_{p} = ∥ h_{j} ∥_{q}, j \in N .

W, H min ∥ X - W H ∥_{F}^{2} + \frac{μ _{w}}{2} ∥ W ∥_{F}^{2} + \frac{μ _{h}}{2} ∥ H ∥_{F}^{2}

W, H min ∥ X - W H ∥_{F}^{2} + \frac{μ _{w}}{2} ∥ W ∥_{F}^{2} + \frac{μ _{h}}{2} ∥ H ∥_{F}^{2}

s.t. W \geq 0,

\displaystyle~{}~{}~{}~{}\begin{array}[]{ll}&{\bf H}\geq{\bm{0}},\\ &\|{\bf h}_{j}\|_{p}^{v}=\|{\bf h}_{j}\|_{q}^{v},~{}j\in{\mathcal{N}},\end{array}\bigg{\}}\triangleq\mathcal{H}_{p,q}^{v}.

f^{'} (x; d) = τ ↘ 0 lim \frac{f ( x + τ d ) - f ( x )}{τ}

f^{'} (x; d) = τ ↘ 0 lim \frac{f ( x + τ d ) - f ( x )}{τ}

W, H min

W, H min

s.t.

W, H min

W, H min

s.t.

W, H min

W, H min

s.t.

p \geq 1, q > p, v \geq q,

p \geq 1, q > p, v \geq q,

H^{k + 1}

H^{k + 1}

x \in R^{n} min

x \in R^{n} min

s.t.

(x^{⋆})_{i} = {(y_{i} + c)^{+}, if i = i^{⋆}, (y_{i})^{+}, otherwise,

(x^{⋆})_{i} = {(y_{i} + c)^{+}, if i = i^{⋆}, (y_{i})^{+}, otherwise,

[Z^{k + 1}]_{ij} = {10 if i = i_{j}^{⋆}, otherwise ., \forall i = 1, \dots, M, j \in N,

[Z^{k + 1}]_{ij} = {10 if i = i_{j}^{⋆}, otherwise ., \forall i = 1, \dots, M, j \in N,

\displaystyle\frac{\sum\limits_{i,j}\dbinom{n_{ij}}{2}-\bigg{[}\sum\limits_{i}\dbinom{n_{i\cdot}}{2}\sum\limits_{j}\dbinom{n_{\cdot j}}{2}\bigg{]}\bigg{/}\dbinom{n}{2}}{\frac{1}{2}\bigg{[}\sum\limits_{i}\dbinom{n_{i\cdot}}{2}+\sum\limits_{j}\dbinom{n_{\cdot j}}{2}\bigg{]}-\bigg{[}\sum\limits_{i}\dbinom{n_{i\cdot}}{2}\sum\limits_{j}\dbinom{n_{\cdot j}}{2}\bigg{]}\bigg{/}\dbinom{n}{2}},

\displaystyle\frac{\sum\limits_{i,j}\dbinom{n_{ij}}{2}-\bigg{[}\sum\limits_{i}\dbinom{n_{i\cdot}}{2}\sum\limits_{j}\dbinom{n_{\cdot j}}{2}\bigg{]}\bigg{/}\dbinom{n}{2}}{\frac{1}{2}\bigg{[}\sum\limits_{i}\dbinom{n_{i\cdot}}{2}+\sum\limits_{j}\dbinom{n_{\cdot j}}{2}\bigg{]}-\bigg{[}\sum\limits_{i}\dbinom{n_{i\cdot}}{2}\sum\limits_{j}\dbinom{n_{\cdot j}}{2}\bigg{]}\bigg{/}\dbinom{n}{2}},

(Orthogonality) ϵ_{orth} = \frac{∥ Q ^{(r)} H ^{(r)} ( Q ^{(r)} H ^{(r)} ) ^{⊤} - I _{K} ∥ _{F}}{K ^{2}},

(Orthogonality) ϵ_{orth} = \frac{∥ Q ^{(r)} H ^{(r)} ( Q ^{(r)} H ^{(r)} ) ^{⊤} - I _{K} ∥ _{F}}{K ^{2}},

(Normalized Residual) ϵ_{NR} = \frac{∥ W ^{(r)} - W ^{(r - 1)} ∥ _{F}}{∥ W ^{(r - 1)} ∥ _{F}} + \frac{∥ H ^{(r)} - H ^{(r - 1)} ∥ _{F}}{∥ H ^{(r - 1)} ∥ _{F}} .

(Normalized Residual) ϵ_{NR} = \frac{∥ W ^{(r)} - W ^{(r - 1)} ∥ _{F}}{∥ W ^{(r - 1)} ∥ _{F}} + \frac{∥ H ^{(r)} - H ^{(r - 1)} ∥ _{F}}{∥ H ^{(r - 1)} ∥ _{F}} .

⟨ 2 (W)^{T} (W H - X) + ν H, D ⟩ \geq 0, \forall D \in T_{H_{p, q}^{v}} (H) .

⟨ 2 (W)^{T} (W H - X) + ν H, D ⟩ \geq 0, \forall D \in T_{H_{p, q}^{v}} (H) .

j \neq = j^{'} \sum ⟨ 2 (W)^{T} (W h_{j} - x_{j}) + ν h_{j}, d_{j} ⟩

j \neq = j^{'} \sum ⟨ 2 (W)^{T} (W h_{j} - x_{j}) + ν h_{j}, d_{j} ⟩

- ⟨ 2 (W)^{T} x_{j^{'}}, d_{j^{'}} ⟩ \geq 0, \forall D \in T_{H_{p, q}^{v}} (H) .

0 > - 2 ⟨(W)^{T} x_{j^{'}}, α e_{ℓ} ⟩ = - 2 α ((W)^{T} x_{j^{'}})_{ℓ} \geq 0,

0 > - 2 ⟨(W)^{T} x_{j^{'}}, α e_{ℓ} ⟩ = - 2 α ((W)^{T} x_{j^{'}})_{ℓ} \geq 0,

G_{ρ} (W^{⋆}, H^{⋆}) \leq G_{ρ} (W^{⋆}, H), \forall H \in N_{ϵ} (H^{⋆}) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wshuai317/NCP_ONMF
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Clustering by Orthogonal NMF Model and Non-Convex Penalty Optimization

Shuai Wang, Tsung-Hui Chang, Ying Cui, and Jong-Shi Pang Shuai Wang and Tsung-Hui Chang are with the Shenzhen Research Institute of Big Data and School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen 518172, China (e-mail: [email protected], [email protected]).Ying Cui is with the Department of Industrial and Systems Engineering, University of Minnesota, Minneapolis, MN 55455, USA (e-mail: [email protected]).Jong-Shi Pang is with the Department of Industrial and Systems Engineering, University of Southern California, Los Angeles, CA 90089, USA (e-mail: [email protected]).

Abstract

The non-negative matrix factorization (NMF) model with an additional orthogonality constraint on one of the factor matrices, called the orthogonal NMF (ONMF), has been found a promising clustering model and can outperform the classical K-means. However, solving the ONMF model is a challenging optimization problem because the coupling of the orthogonality and non-negativity constraints introduces a mixed combinatorial aspect into the problem due to the determination of the correct status of the variables (positive or zero). Most of the existing methods directly deal with the orthogonality constraint in its original form via various optimization techniques, but are not scalable for large-scale problems. In this paper, we propose a new ONMF based clustering formulation that equivalently transforms the orthogonality constraint into a set of norm-based non-convex equality constraints. We then apply a non-convex penalty (NCP) approach to add them to the objective as penalty terms, leading to a problem that is efficiently solvable. One smooth penalty formulation and one non-smooth penalty formulation are respectively studied. We build theoretical conditions for the penalized problems to provide feasible stationary solutions to the ONMF based clustering problem, as well as proposing efficient algorithms for solving the penalized problems of the two NCP methods. Experimental results based on both synthetic and real datasets are presented to show that the proposed NCP methods are computationally time efficient, and either match or outperform the existing K-means and ONMF based methods in terms of the clustering performance.

Keywords $-$ Data clustering, orthogonal non-negative matrix factorization, penalty method.

I Introduction

Clustering is one of the most fundamental data mining tasks and has an enormous number of applications [1]. Typically, clustering is a key intermediate step to explore underlying structure of massive data for subsequent analysis. For example, in internet applications, clustering is used to identify users of different interests who then can be provided with specific service recommendation [2, 3]. In biology, clustering can be used for pattern discovery of genes which can help identify subtypes of a certain disease or cancer [4, 5, 6].

Among the existing clustering methods, the K-means [7] is the most widely used one, thanks to its simplicity [8]. However, the K-means may not always yield satisfactory clustering results. On one hand, from an optimization perspective, the iterative steps of finding the cluster centroids and cluster assignment in K-means are equivalent to solving a binary integer constrained matrix factorization problem by alternating optimization [9, 10]. Due to the non-convex matrix factorization model and binary integer constraint, the iterates of K-means are likely to be stuck at an unsatisfactory local point, and are sensitive to the choice of initial points [11]. On the other hand, the K-means overlooks the inherent low-rank structure and prior information which are usually owned by high-dimensional real data. Therefore, various dimension-reduction techniques such as principal component analysis (PCA), spectral clustering [12], non-negative matrix factorization (NMF) [13, 14] and deep neural networks [15, 16] are proposed. However, these methods are merely used as a preprocessing stage to find a clustering-friendly representation for the data, and the K-means is still often used for clustering the dimension-reduced data. Thus, the intrinsic drawback of the K-means caused by the non-convex nature and discrete constraints is not addressed.

Recently, as an variant of NMF, the orthogonal NMF (ONMF) model has been considered for data clustering [17, 18, 19, 20, 21, 22, 6]. The ONMF model imposes an additional orthogonality constraint on one of the factor matrices in NMF. It turns out that, like the K-means, the orthogonally constrained factor matrix functions the same as an indicator matrix that shows how the data samples are assigned to different clusters [17, 21]. Therefore, the ONMF model can be regarded as a continuous relaxation (which has no discrete constraint) of the K-means. Studies on various data mining tasks have found that the ONMF model can outperform the K-means and NMF based clustering methods [23, 24, 18, 20, 21, 22, 6].

I-A Related Works

Despite its widespread use, solving the ONMF problem is challenging due to the existence of both the orthogonality and non-negativity constraints. Many of the existing ONMF algorithms extend upon the classical multiplicative update (MU) rule by Lee and Seung [13] for the vanilla NMF to accommodate the additional orthogonality constraint. For example, reference [17] penalized the orthogonality constraint followed by applying the MU rule. The authors of [19] derived the MU rule directly using the gradient vector over the Stiefel manifold. Reference [21] employed the augmented Lagrangian (AL) method that penalizes the non-negativity constraint and applies the gradient projection method for the orthogonally constrained subproblem. Since projection onto the set of orthonormal matrices involves singular value decomposition (SVD), the orthogonal nonnegatively penalized matrix factorization (ONP-MF) method in [21] can be computationally inefficient. The hierarchical alternating least squares (HALS) method proposed in [20] is claimed to achieve a better balance between the orthogonality and non-negativity constraints. The HALS method updates one column and one row of the two respective factor matrices at the same time in each iteration, subject to the orthogonality and non-negativity constraints. Since the subproblem in HALS is handled by the same MU rule in [17], it is computationally efficient in general. Reference [22] proposed to approximate the ONMF solution by solving a low-rank non-negative PCA problem, which however involves generating a large number of candidate solutions. While the method in [22] is the first that can provide provable approximation guarantee for the ONMF problem, it can be computationally inefficient especially when a high-quality solution is sought.

On the other hand, it is noticed that some recent Riemannian optimization methods on the Stiefel manifold [25, 26] can handle problems with orthogonality constraints together with non-smooth, Lipschitz-continuous regularizations. However, since the indicator function induced by the non-negativity constraint is not Lipschitz continuous, these methods are not applicable to the ONMF problem.

I-B Contributions

In this paper, we propose a new approach to handle the ONMF based data clustering problem, aiming at achieving high-quality clustering performance with reasonable computational cost. Specifically, by recognizing the fact that the coupling of the orthogonality constraint and non-negativity constraint introduces disjunctions into the problem and projection onto the orthogonal set requires expensive SVD, we avoid directly dealing with the orthogonality constraint as done in the existing methods [17, 20, 19, 21]. Instead, we propose in [27] a novel problem reformulation for the ONMF problem that replaces the orthogonality constraint by a set of norm-based (non-convex) equality constraints. The second ingredient of the proposed approach is the use of the penalty method in optimization [28] to add these non-convex norm-equality constraints as penalty terms in the objective function while keeping the non-negativity in the constraints. The advantages of the proposed method are twofold. First, the penalty method allows to find a feasible solution to satisfy the norm-equality constraints gradually in a gentle fashion, and is less likely to be stuck at bad local solutions. Second, since only simple non-negativity constraints are left, the penalized problem can be efficiently handled by the existing proximal alternating linearized minimization (PALM) method [29] without involving any SVD. As will be shown shortly, the proposed algorithms are inherently parallel and are more computationally efficient than most of the existing methods in practice.

In particular, we consider two novel types of non-convex penalty (NCP) formulations - one is smooth that has squared $\ell_{1}$ -norm minus squared $\ell_{2}$ -norm as the penalty term, and the other one is non-smooth that has $\ell_{1}$ -norm minus $\ell_{\infty}$ -norm as the penalty term. The two penalties lead to different theoretical properties and algorithm developments. We obtain theoretical conditions under which the two penalty methods can yield a feasible and meaningful solution to the considered ONMF based clustering problem. We also develop computationally efficient PALM algorithms to solve the penalized problems for the smooth and non-smooth NCP methods, respectively. It is worthy to mention that the penalized problem of the non-smooth NCP method is a non-convex and non-smooth problem with a negative infinity norm regularization. We propose a novel non-convex proximal operator of the negative infinity-norm function, and show that it has a simple closed-form solution.

Extensive experiments are conducted based on a synthetic data set [10], the gene dataset TCGA [30, 31] and the document dataset TDT2 [32]. The experimental results demonstrate that the proposed smooth and non-smooth NCP methods can provide either comparable or greatly better performance than the existing K-means based methods and ONMF based methods in [17, 21, 19, 20], and at the same time are more time efficient than most of the existing ONMF based methods.

Synopsis: Section II reviews the existing K-means and ONMF model, and presents the proposed clustering problem formulation. The proposed NCP optimization framework is presented in Section III, where theoretical conditions for the smooth and non-smooth NCP methods to yield feasible stationary solutions are analyzed. The PALM algorithms used for solving the smooth and non-smooth penalized problems are given in Section IV. Experimental results are presented in Section V, and lastly the conclusions are drawn in Section VI.

Notation: We use boldface lowercase letters and boldface uppercase letters to represent column vectors and matrices, respectively. $\mathbb{R}^{m\times n}$ denotes the set of $m$ by $n$ real-valued matrices. The $(i,j)$ th entry of matrix ${\bf A}$ is denoted by $[{\bf A}]_{ij}$ ; the $i$ th element of vector ${\bf a}$ is denoted by $({\bf a})_{i}$ or $a_{i}$ . Superscript $\top$ stands for matrix transpose. For a matrix ${\bf A}\in\mathbb{R}^{m\times n}$ , its column vectors are denoted by ${\bf a}_{j}\in\mathbb{R}^{m}$ , $j=1,\ldots,n$ , and its row vectors are denoted by $\widetilde{}{\bf a}_{i}\in\mathbb{R}^{n}$ , $i=1,\ldots,m$ ; that is, ${\bf A}=\begin{bmatrix}{\bf a}_{1},\ldots,{\bf a}_{n}\end{bmatrix}=\begin{bmatrix}\widetilde{}{\bf a}_{1},\ldots,\widetilde{}{\bf a}_{m}\end{bmatrix}^{\top}.$ We denote ${\bf A}\geq{\bm{0}}$ as a non-negative matrix, i.e., $[{\bf A}]_{ij}\geq 0$ for all $i=1,\ldots,m,j=1,\ldots,n$ . ${\bf 1}$ denotes the all-one vector, ${\bm{0}}$ is the all-zero vector, ${\bm{I}}_{m}$ is the $m$ by $m$ identity matrix, and ${\bm{e}}_{n}$ denotes the elementary vector with one in the $n$ th entry and zero otherwise. $\|\cdot\|_{F}$ and $\|\cdot\|_{p}$ are the matrix Frobenius norm and vector $p$ -norm, respectively. $\langle{\bf A},{\bm{B}}\rangle$ denotes the inner product between matrices ${\bf A}$ and ${\bm{B}}$ . $\lambda_{\max}({\bf A})$ stands for the maximum eigenvalue of matrix ${\bf A}$ . For a convex function $f\colon\mathbb{R}^{n}\rightarrow\mathbb{R}$ , $\partial f({\bf x})$ denotes the subdifferential of $f$ at ${\bf x}$ as in standard convex analysis [33]. Lastly, $[{\bf A}]^{+}$ denotes $\max\{{\bf A},{\bm{0}}\}$ which is a matrix that reserves the non-negative elements of ${\bf A}$ .

II Data Clustering and ONMF Model

II-A K-Means and ONMF Model

Let ${\bf X}\geq{\bm{0}}$ be a non-negative data matrix that contains $N$ data samples and each of the samples has $M$ features, i.e., ${\bf X}=[{\bf x}_{1},\ldots,{\bf x}_{N}]\in\mathbb{R}^{M\times N}$ . The task of data clustering is to assign the $N$ data samples into a predefined number of $K$ clusters in the sense that the samples belonging to one cluster are close to each other based on certain distance metric. The most popular setting is to consider the Euclidean distance and the use of the K-means due to its simplicity.

From an optimization point of view [9, 10], the iterative procedure of K-means can be interpreted as an alternating optimization algorithm applied to the following matrix factorization problem

[TABLE]

where ${\mathcal{K}}\triangleq\{1,\ldots,K\}$ and ${\mathcal{N}}\triangleq\{1,\ldots,N\}$ . Here, columns of ${\bf W}\in\mathbb{R}^{M\times K}$ represent centroids of the $K$ clusters, while the matrix ${\bf H}\in\mathbb{R}^{K\times N}$ indicates the cluster assignment of samples. Specifically, $[{\bf H}]_{ij}=1$ indicates that the $j$ th sample ${\bf x}_{j}$ is uniquely assigned to cluster $i$ , and $[{\bf H}]_{ij}=0$ otherwise.

One can see from (1) that, when ${\bf W}=[{\bm{w}}_{1},\ldots,{\bm{w}}_{K}]$ is given, the optimal ${\bf H}$ is obtained by assigning each sample to the cluster that has the nearest centroid, while when ${\bf H}$ is given, the optimal ${\bm{w}}_{i}$ is given by the centroid (average of samples) assigned to the $i$ th cluster, for all $i\in{\mathcal{K}}$ . The two steps are exactly the well-known K-means algorithm. However, due to the non-convex binary constraint (1b) and the hard clustering assignment during the iterative steps of K-means, the K-means is sensitive to the initial conditions and may not always yield satisfactory clustering performance [11].

Regarded as a relaxation of (1), the following vanilla NMF model is also considered for data clustering [14, 21, 10]

[TABLE]

However, since (1b) is completely removed from (1), the obtained ${\bf H}$ from the above vanilla NMF model cannot reveal clear clustering assignment.

It was shown in [17, 21] that the ONMF model, which has an additional orthogonality constraint on ${\bf H}$ , is closely related to the K-means model (1). In particular, the ONMF problem is given by

[TABLE]

A key observation is given as below.

Observation: Any ${\bf H}$ satisfying ${\bf H}\geq{\bm{0}}$ and ${\bf H}{\bf H}^{\top}={\bm{I}}_{K}$ has at most one non-zero entry in each column.

Thus, matrix ${\bf H}$ in (3) functions similarly as that in (1) whose nonzero elements indicate the cluster assignment of data samples. Nevertheless, different from (1), the non-zero entries of ${\bf H}$ in (3) are not restricted to be either zero or one but can be scaled. Owing to the two facts, the ONMF model is less sensitive to data scaling and is preferred than the K-means especially for data where clustering results are independent of data scaling; see [21] for more discussions.

However, the ONMF problem (3) is intrinsically challenging to solve since the above Observation indicates that the intersection of ${\bf H}\geq{\bm{0}}$ and ${\bf H}{\bf H}^{\top}={\bm{I}}_{K}$ is mixed combinatorial due to the determination of the correct status of the variables (positive or zero). In view of this, the philosophy of the existing methods is to handle the two constraints separately either by the penalty method [17, 20, 19] or by the AL method [21] which requires repetitive projections onto the orthogonal set via expensive SVD. Unlike the existing methods, we present a novel problem reformulation of (3) which avoids dealing with the orthogonality constraint directly. Based on the reformulated problem, we propose a non-convex penalty (NCP) optimization framework that is not only amenable to efficient computation but also able to provide favorable clustering performance.

II-B Proposed Clustering Formulation

We note that any vector ${\bf x}\in\mathbb{R}^{n}$ has at most one non-zero entry if and only if

[TABLE]

According to the Observation, any ${\bf H}$ satisfying (3c) and (3b) also lies in the following set

[TABLE]

Thus, the ONMF model (3) can be equivalently written as

[TABLE]

Firstly, note that the condition $\|\widetilde{}{\bf h}_{i}\|_{2}=1,~{}i\in{\mathcal{K}}$ , is not intrinsic to the data clustering task. In essence, both ${\bf H}$ and ${\bf Q}{\bf H}$ , where ${\bf Q}\geq 0$ is a diagonal matrix, indicate the same cluster assignment, and both $({\bf W},{\bf H})$ and $({\bf W}{\bf Q}^{-1},{\bf Q}{\bf H})$ have the same objective values in (8a). Therefore, without loss of the clustering performance, we remove $\|\widetilde{}{\bf h}_{i}\|_{2}=1,~{}i\in{\mathcal{K}}$ , from (8). Secondly, for bounded solution, we add the regularization term $\frac{\mu_{w}}{2}\|{\bf W}\|_{F}^{2}+\frac{\mu_{h}}{2}\|{\bf H}\|_{F}^{2}$ to (8), where $\mu_{w},\mu_{h}\geq 0$ are two parameters111Adding the regularization term makes the objective coercive, which subsequently can guarantee bounded solutions to the optimization problem. . Thirdly, without loss of generality, we replace $\|{\bf h}_{j}\|_{p}=\|{\bf h}_{j}\|_{q}$ with $\|{\bf h}_{j}\|_{p}^{v}=\|{\bf h}_{j}\|_{q}^{v}$ by adding a power exponent $v>0$ . As a result, we have the following problem formulation for data clustering:

Proposed clustering formulation:

$\displaystyle\min_{{\bf W},{\bf H}}~{}\|{\bf X}-{\bf W}{\bf H}\|_{F}^{2}+\frac{\mu_{w}}{2}\|{\bf W}\|_{F}^{2}+\frac{\mu_{h}}{2}\|{\bf H}\|_{F}^{2}$

(9a)

$\displaystyle~{}~{}~{}{\rm s.t.}~{}{\bf W}\geq{\bm{0}},$

(9b)

$\displaystyle~{}~{}~{}~{}\begin{array}[]{ll}&{\bf H}\geq{\bm{0}},\\ &\|{\bf h}_{j}\|_{p}^{v}=\|{\bf h}_{j}\|_{q}^{v},~{}j\in{\mathcal{N}},\end{array}\bigg{\}}\triangleq\mathcal{H}_{p,q}^{v}.$

(9e)

One clear advantage of formulation (9) is its scalability. Concisely, when ${\bf W}$ is fixed, problem (9) can be fully decoupled across the columns of ${\bf H}$ into $N$ subproblems. This makes it easy to apply some decomposition methods for dealing with large-scale clustering tasks. Nevertheless, it is still challenging to solve (9). The main challenge lies in two interwound issues. The first is how to deal with the non-convex objective function and the (possibly non-smooth) norm-equality constraint (9e) together with the non-negativity constraint (9b). The second is how to choose proper values of $p$ , $q$ and $v$ since different choices lead to different theoretical and computational properties.

Before proceeding with the algorithmic development, we first present basic definitions for characterizing a proper solution of the non-convex and possibly non-smooth problem (9).

Definition 1

(Tangent cone) Let ${\mathcal{X}}\subseteq\mathbb{R}^{n}$ and ${\bf x}\in{\mathcal{X}}$ . A vector ${\bm{d}}$ is a tangent of ${\mathcal{X}}$ at ${\bf x}$ if either ${\bm{d}}={\bm{0}}$ or there exists a sequence $\{{\bf x}^{k}\}\subset{\mathcal{X}}$ and positive scalars $\{\tau^{k}\}$ such that $\lim_{k\to\infty}\frac{{\bf x}^{k}-{\bf x}}{\tau^{k}}={\bm{d}}$ when ${\bf x}^{k}\to{\bf x},~{}\tau^{k}\searrow 0$ . The tangent cone of ${\mathcal{X}}$ at ${\bf x}$ , denoted by ${\mathcal{T}}_{{\mathcal{X}}}({\bf x})$ , contains all the tangents of ${\mathcal{X}}$ at ${\bf x}$ .

Definition 2

(Directional derivative) Let $f:\mathbb{R}^{n}\to\mathbb{R}$ be a possibly non-smooth function. Then $f$ is directionally differentiable if the directional derivative of $f$ along any direction ${\bm{d}}\in\mathbb{R}^{n}$

[TABLE]

exists at any ${\bf x}\in\mathbb{R}^{n}$ .

If $f$ is differentiable, then the directional derivative $f^{\prime}({\bf x};{\bm{d}})$ in the above definition reduce to $\nabla f({\bf x})^{\top}{\bm{d}}$ , where $\nabla f({\bf x})$ is the gradient vector of $f$ at ${\bf x}$ .

Definition 3

(B-stationary solution)[34] For an optimization problem $\min_{{\bf x}\in{\mathcal{X}}}f(x)$ , where $f:\mathbb{R}^{n}\to\mathbb{R}$ is directionally differentiable, and ${\mathcal{X}}$ is a closed set. Then, $\overline{}{\bf x}\in{\mathcal{X}}$ is a B-stationary point if $f^{\prime}(\overline{}{\bf x};{\bm{d}})\geq 0,\forall{\bm{d}}\in{\mathcal{T}}_{{\mathcal{X}}}(\overline{}{\bf x}).$

If ${\mathcal{X}}$ is a convex set, then the condition is equivalent to $f^{\prime}(\overline{}{\bf x};{\bf x}-\overline{}{\bf x})\geq 0,\forall{\bf x}\in{\mathcal{X}},$ and $\overline{}{\bf x}$ is known as the d-stationary point. If $f$ is differentiable and ${\mathcal{X}}$ is convex, then the condition reduces to $\nabla f(\overline{}{\bf x})^{\top}({\bf x}-\overline{}{\bf x})\geq 0,\forall{\bf x}\in{\mathcal{X}},$ and $\overline{}{\bf x}$ is simply called a stationary point [34, (1.3.3)].

As proved in Appendix A, the following proposition shows that any B-stationary point of (9) has non-zero columns for ${\bf H}$ .

Proposition 1

Let $(\overline{}{\bf W},\overline{}{\bf H})$ be a B-stationary point of (9) satisfying $(\overline{}{\bf W})^{T}{\bf x}_{j}\neq{\bm{0}},\forall j\in{\mathcal{N}}$ . Then, ${\overline{}{\bf h}_{j}}\neq{\bm{0}},\forall j\in{\mathcal{N}}$ .

The condition of $(\overline{}{\bf W})^{T}{\bf x}_{j}\neq{\bm{0}},\forall j\in{\mathcal{N}}$ is actually mild as it is known that the centroid ${\bf W}$ should usually lie in the space spanned by the cluster samples [35]. Proposition 1 implies that a B-stationary point of (9) is clustering meaningful since, at a B-stationary point, each data sample must be properly assigned to one of the clusters.

III Proposed Non-Convex Penalty Method

In this section, we present the proposed NCP framework to handle problem (9). Let $\phi({\bf h}_{j})\triangleq\|{\bf h}_{j}\|_{p}^{v}-\|{\bf h}_{j}\|_{q}^{v}\geq 0$ for $1\leq p<q$ . Since $\phi({\bf h}_{j})=0$ is equivalent to making mixed binary decisions for each entry of ${\bf h}_{j}$ (being zero or non-zero), our intuition is to have an algorithm that can gently arrive at the mixed binary decision so that it can be less sensitive to bad initial points. Also, it is desirable to handle a problem with simple constraint sets. Motivated by this, we consider the following penalized formulation

[TABLE]

where $F({\bf W},{\bf H})\triangleq\|{\bf X}-{\bf W}{\bf H}\|_{F}^{2}+\frac{\mu_{w}}{2}\|{\bf W}\|_{F}^{2}+\frac{\mu_{h}}{2}\|{\bf H}\|_{F}^{2}$ , and $\rho>0$ is a penalty parameter. As seen from (11), since the non-convex norm-equality constraints in (9e) are penalized in the objective function, the penalized problem (11) involves simple convex constraint set only.

Like the classical penalty method [28, Chapter 17], we attempt to reach a good clustering solution of (9) through solving a sequence of penalized subproblems in (11), by gradually increasing $\rho$ . The proposed NCP framework is shown in Algorithm 1. It is expected that the norm-equality constraint (9e) can be gradually satisfied as $\rho$ increases, implying that the clustering assignment is achieved step by step in a smooth manner. This is in contrast to the classical K-means which has hard decision on the clustering assignment in the iterative process. This property also makes the proposed NCP method less sensitive to the choice of the initial point, which is one of the key advantages. An illustrative example demonstrating this point is given in [36, Section 5] of the supplementary material.

We study two types of penalty functions. In particular, we respectively consider a smooth penalty and a non-smooth penalty for (11). The smooth penalty is inspired by the classical quadratic penalty method [28, Chapter 17.1] where we choose $p=1,q=2,v=2$ for (11). For the non-smooth penalty, we choose $p=1,q=\infty$ and $v=1$ . As will be shown in Section III-B, such choice can provide so called exact penalty property [28]222We say the penalized problem like (11) has exact penalty if there exists a finite value of $\rho^{*}$ such that for any $\rho>\rho^{*}$ the solution of (11) is also a (feasible) solution of (9). while only requiring relaxed conditions on the solution of (11). To justify the effectiveness of the two penalties, in the ensuing two subsections, we present theoretical conditions for which the penalty method and Algorithm 1 can yield a feasible stationary solution to problem (9). In Section IV, efficient algorithms for solving (11) are presented.

III-A Smooth NCP (SNCP) Method

The proposed SNCP is obtained by choosing $p=1,q=2$ and $v=2$ in (9) and (11). Since ${\bf H}\geq{\bm{0}}$ , $\|{\bf h}_{j}\|_{1}={\bf 1}^{\top}{\bf h}_{j},j\in{\mathcal{N}}$ , the corresponding penalized problem (11) is given by

[TABLE]

which has $G_{\rho}({\bf W},{\bf H})\triangleq F({\bf W},{\bf H})+\frac{\rho}{2}\sum_{j=1}^{N}\big{(}({\bf 1}^{\top}{\bf h}_{j})^{2}-\|{\bf h}_{j}\|_{2}^{2}\big{)}$ as the smooth objective function.

The following proposition provides the theoretical justification for the SNCP method.

Theorem 1

There exists a finite $\rho^{\star}>0$ such that for any $\rho>\rho^{\star}$ , any local minimum solution of (12) is a feasible and a local minimum solution to (9) (with $p=1$ , $q=2$ and $v=2$ ).

The proof of Theorem 1 is given in Appendix B. Theorem 1 shows that the SNCP has the exact penalty as long as a local minimum of (12) can be reached. Unfortunately, this cannot be guaranteed in general. The following theorem asserts a relaxed condition that if only a stationary solution (not local minimum) is obtained, (12) can still yield a feasible B-stationary solution of (9) as long as $\rho$ goes to infinity.

Theorem 2

*Let $({\bf W}^{\rho},{\bf H}^{\rho})$ denote a stationary point of (12), and assume that $({\bf W}^{\rho},{\bf H}^{\rho})$ is bounded and it has a limit point $({\bf W}^{\infty},{\bf H}^{\infty})\neq{\bm{0}}$ when $\rho\to\infty$ . Then, $({\bf W}^{\infty},{\bf H}^{\infty})$ is a feasible B-stationary point to (9). *

In practice, a finite value of $\rho$ would be sufficient to obtain a reasonably good solution as one will see in Section V. It is worthwhile to note that for the general quadratic penalty method [28, Chapter 17.1] to claim the same result as in Theorem 2, one requires conditions such as $({\bf W}^{\infty},{\bf H}^{\infty})$ satisfies certain constraint qualification. Our Theorem 2 does not have such requirement thanks to the special constraint structure of (9); see the proof of Theorem 2 in Appendix C.

III-B Non-smooth NCP (NSNCP) method

In this subsection, we consider a non-smooth NCP method by setting $p=1,q=\infty$ and $v=1$ in (9) and (11). The corresponding penalized problem (11) is given by

[TABLE]

where the objective function $F_{\rho}({\bf W},{\bf H})\triangleq F({\bf W},{\bf H})+{\rho}\sum_{j=1}^{N}\big{(}{\bf 1}^{\top}{\bf h}_{j}-\|{\bf h}_{j}\|_{\infty}\big{)}$ is non-smooth. Interestingly, analogous to the exact penalty in Theorem 1 but without the need of a local minimum solution, the NSNCP method allows one to obtain a feasible B-stationary solution of (9) if a d-stationary solution of (13) can be obtained.

Theorem 3

*Assume that $({\bf W}^{\rho},{\bf H}^{\rho})$ is bounded. Then, there exists a finite $\rho^{\star}>0$ such that for all $\rho>\rho^{\star}$ , if $({\bf W}^{\rho},{\bf H}^{\rho})$ is a d-stationary point of (13), then $({\bf W}^{\rho},{\bf H}^{\rho})$ is feasible and a B-stationary point to (9). *

The proof of Theorem 3 is presented in Appendix D. By comparing Theorem 3 with Theorem 1, one can see that the NSNCP method in (13) is theoretically preferred since it does not require the penalized problem (13) to provide a locally optimal solution (which computationally cannot be ascertained) but only a d-stationary point; by comparing Theorem 3 with Theorem 2, we see that the NSNCP requires a finite value of $\rho$ in the non-smooth formulation (12a) versus an infinite value of $\rho$ in the smooth formulation (13a). However, since (13) is non-convex and non-smooth, it is more challenging to handle than its smooth counterpart (12). In the next section, we present efficient algorithms that can be used to obtain a proper stationary solution of the penalized problems (12) and (13), respectively.

Remark 1

(On the choice of $p,q,$ and $v$ ) As mentioned, the smooth penalty in (12) with $p=1,q=2,v=2$ is inspired by the classical quadratic penalty method [28, Chapter 17.1]. A natural question is – can we have other choices? In fact, one can verify that as long as $p,q,v$ satisfies

[TABLE]

then (11) will be a smooth problem. Moreover, one can have the same claim as Theorem 2 based on a straightforward extension of the proof of Theorem 2. However, we prefer the choice in (12) since we are not aware of any theoretical result in the literature that suggests one can benefit from higher orders of penalty functions. In addition, our experiences in numerical experiments in fact indicate that higher order choices may yield poor performance. As shown in the supplementary material [36, Section 3.1], a large value of $v$ would make the landscape of $\phi({\bf h}_{j})=\|{\bf h}_{j}\|_{p}^{v}-\|{\bf h}_{j}\|_{1}^{v}$ has a flat valley around the origin. Besides, as shown in [36, Section 3.2], large values of $v$ and $q$ make Algorithm 1 less stable and more difficult to reach a feasible solution satisfying $\phi({\bf h}_{j})=0$ .

Our choice of $p=1,q=\infty,v=1$ for the non-smooth penalty (13) is constructed on purpose in order to achieve the exact penalty property in Theorem 3. As one can see from (46) that a key property to prove Theorem 3 is that the directional derivative of $\phi({\bf x}^{\rho})={\bf 1}^{\top}{\bf x}^{\rho}-\|{\bf x}^{\rho}\|_{\infty}$ has an upper bound linear in ${\bm{d}}$ and independent of the exact values of ${\bf x}^{\rho}$ . Such property would not hold for $v>1$ or other choices of $q>1$ . In Section IV-B, we will further show that (13) with the negative infinity norm can have a simple proximal operator which enable us to solve (13) efficiently. **

IV Obtaining a Stationary Point of (11)

In accordance with Theorem 2 and Theorem 3, we need to obtain a stationary solution for problem (12) and a d-stationary solution for problem (13), respectively. In view of the separable constraint structure for ${\bf W}$ and ${\bf H}$ , the PALM algorithm in [29] is particularly efficient in handling the two problems.

IV-A Algorithm for Solving (12)

By applying the PALM algorithm to the smooth penalized problems (12), one simply performs block-wise gradient projection with respect to ${\bf W}$ and ${\bf H}$ iteratively, as shown in Algorithm 2. Here, $t^{k}$ and $c^{k}$ are two step size parameters. Denote $L_{G}({\bf W}^{k})$ and $L_{G}({\bf H}^{k})$ as the Lipschitz constants of $\nabla_{{\bf H}}G_{\rho}({\bf W}^{k},{\bf H})$ and $\nabla_{{\bf W}}G_{\rho}({\bf W},{\bf H}^{k})$ , respectively. Convergence of Algorithm 2 can be established based on [29].

Theorem 4

Let $\{{\bf W}^{k},{\bf H}^{k}\}$ be the sequence generated by Algorithm 2 with $t^{k}>\frac{L_{G}({\bf W}^{k})}{2}$ and $c^{k}>\frac{L_{G}({\bf H}^{k})}{2}$ . Then $\{G_{\rho}({\bf W}^{k},{\bf H}^{k})\}$ is non-increasing, $\{{\bf W}^{k},{\bf H}^{k}\}$ is bounded, and $\{{\bf W}^{k},{\bf H}^{k}\}$ converges to a stationary point of (12).

Proof: Since $G_{\rho}({\bf W},{\bf H})$ is a coercive function, and the PALM updates in Algorithm 2 guarantees descent of the objective function $G_{\rho}({\bf W}^{k},{\bf H}^{k})$ [29, Remark 4(iii)] under $t^{k}>\frac{L_{G}({\bf W}^{k})}{2}$ and $c^{k}>\frac{L_{G}({\bf H}^{k})}{2}$ , $\{{\bf W}^{k},{\bf H}^{k}\}$ are bounded. Besides, because $G_{\rho}({\bf W},{\bf H})$ satisfies the Kurdyka-Lojasiewicw (KL) property, and ${\bf W}\geq{\bm{0}}$ , ${\bf H}\geq{\bm{0}}$ are convex sets, we can obtain the desired results by [29, Theorem 1]. $\blacksquare$

IV-B Algorithm for Solving (13)

The non-smooth penalized problem (13) is more challenging to handle due to the non-smooth and non-convex term $-\|{\bf h}_{j}\|_{\infty}$ . In particular, when applying the same PALM strategy to problem (13), the corresponding subproblem for updating ${\bf H}$ is given by

[TABLE]

where ${\bm{B}}^{k+1}={\bf H}^{k}-\frac{1}{t^{k}}\nabla_{{\bf H}}\widetilde{F}_{\rho}({\bf W}^{k},{\bf H}^{k})$ and $\widetilde{F}_{\rho}({\bf W},{\bf H})\triangleq F({\bf W},{\bf H})+\rho\sum_{j=1}^{N}{\bf 1}^{\top}{\bf h}_{j}$ is the smooth component in (13a). Problem (15) is a proximal operator associated with the non-smooth and concave function $-\rho\sum_{j=1}^{N}\|{\bf h}_{j}\|_{\infty}$ . Intriguingly, while being non-convex, the proximal operator (15) has a simple closed-form solution as stated below.

Proposition 2

Consider the following problem

[TABLE]

where ${\bm{y}}=[y_{1},\ldots,y_{n}]^{\top}\in\mathbb{R}^{n}$ is given and $c>0$ is a scalar. Denote ${\bf x}^{\star}=[x_{1}^{\star},\dots,x_{n}^{\star}]^{\top}$ as an optimal solution of (16), and let $i^{\star}$ be the unique index such that $\|{\bf x}^{\star}\|_{\infty}=({\bf x}^{\star})_{i^{\star}}$ . Then, $\displaystyle i^{\star}\in\arg\max_{i=1,\ldots,n}y_{i}$ and

[TABLE]

for $i=1,\ldots,n$ .

Proposition 2 is proved in Appendix E. By applying Proposition 2 to (15), we obtain the PALM method for solving problem (13) in Algorithm 3.

We show in the following theorem that Algorithm 3 can yield a d-stationary solution of the penalized problem (13), as desired by Theorem 3. Denote $L_{\widetilde{F}}({\bf W}^{k})$ and $L_{F}({\bf H}^{k})$ as the Lipschitz constant of $\nabla_{{\bf H}}\widetilde{F}({\bf W}^{k},{\bf H})$ and $\nabla_{{\bf W}}F_{\rho}({\bf W},{\bf H}^{k})$ , respectively.

Theorem 5

Let $\{{\bf W}^{k},{\bf H}^{k}\}$ be the sequence generated by Algorithm 3 with $t^{k}>L_{\widetilde{F}}({\bf W}^{k})$ and $c^{k}>\frac{L_{F}({\bf H}^{k})}{2}$ . Then $\{F_{\rho}({\bf W}^{k},{\bf H}^{k})\}$ is non-increasing, $\{{\bf W}^{k},{\bf H}^{k}\}$ is bounded, and $\{{\bf W}^{k},{\bf H}^{k}\}$ converges to a d-stationary point of (13).

The proof is presented in Appendix F. By combining Theorem 3 and Theorem 5, we see that the NSNCP method with Algorithm 1 and Algorithm 3 can yield a feasible B-stationary solution of problem (9).

We should emphasize again that Proposition 2 is critical to satisfy the condition of Theorem 3 since it enables Algorithm 3 to output a d-stationary solution of problem (13) as stated in Theorem 5. On the contrary, if one uses the subgradient method for problem (13), it can only achieve a critical point defined based on the subdifferential of the non-smooth terms [29, Theorem 1], which is weaker than the d-stationary point due to the difference-of-convex (DC) objective function of problem (13); see [37] for detailed discussions.

Before ending the section, we have the following remarks.

Remark 2

(Comparison between SNCP and NSNCP) As presented in Section III, the NSNCP method has a stronger theoretical result since it can guarantee exact penalty without necessarily achieving a local minimum solution of the penalized problem (11). However, numerically we found the two methods in fact perform comparably to each other in terms of data clustering; see Section V for comparison details of these two versions of the penalty approach. Moreover, both methods can often yield more favorable results than the classical K-means, and outperform the existing ONMF methods in terms of both clustering performance and computation time. This will be discussed in Section V. **

Remark 3

(Complexity comparison with existing ONMF methods) As discussed in [20], the DTPP [17], ONMF-S [20] have the same per-iteration complexity of the order $\mathcal{O}(MNK+K^{2}M)$ , and the HALS [20] has the per-iteration complexity of the order $\mathcal{O}(MNK+K^{2}(M+N))$ whereas ONP-MF [21] has a much higher per-iteration complexity which is no less than $\mathcal{O}(N^{3}+KN^{2}+MNK)$ due to the SVD of a $K\times N$ matrix. One can easily verify that the per-iteration complexity of both the proposed Algorithms 2 and 3 is $\mathcal{O}(MNK+K^{2}(M+N))$ , which is comparable to the complexity order of DTPP, ONMF-S and HALS since $K\ll\min\{M,N\}$ . In Section V, we will further demonstrate that the proposed SNCP/NSNCP methods have faster convergence speed than most of the existing ONMF methods, and therefore are computation-time more efficient.

V Experiment Results

In this section, we examine the clustering performance of the proposed SNCP and NSNCP methods against 6 existing clustering methods, namely, K-means (KM), K-means++ [11], DTPP [17], ONP-MF [21], ONMF-S [19] and HALS [20]. Note that the later four methods are all based on the ONMF model (3). The adjusted Rand index (ARI) [38] and clustering accuracy (ACC) [32] are adopted for performance evaluation, both of which are widely-used metrics for clustering validation. ACC is the ratio of the number of correctly clustered data samples to the total number of samples. A data sample is said to be correctly clustered if it is assigned to the same cluster as that in the ground truth. To define ARI, let $S=\{s_{1},s_{2},\ldots,s_{n}\}$ be a set of data samples, $C=\{c_{1},\ldots,c_{K}\}$ be a clustering result obtained by an algorithm, and $L=\{l_{1},\ldots,l_{K}\}$ is the clustering result by the ground truth. The ARI of the clustering $C$ is defined as

[TABLE]

where $n_{i,j}$ is the number of samples that are in both cluster $c_{i}$ of $C$ and $l_{j}$ of $L$ , $n_{i\cdot}=\sum_{j}n_{i,j}$ is the number of samples in cluster $c_{i}$ of $C$ , $n_{\cdot j}=\sum_{i}n_{i,j}$ is the number of samples in cluster $l_{j}$ of $L$ . A larger value of ARI indicates a better clustering result. However, due to limited space, we only present the results of ACC here and relegate the results of ARI in [36, Section 4]. Both of these methods are implemented with python and have been uploaded to https://github.com/wshuai317/NCP\_ONMF.

V-A Performance with Synthetic Data

We follow the linear model ${\bf X}={\bf W}{\bf H}+{\bf V}$ in [10] to generate the synthetic data where ${\bf V}\in\mathbb{R}^{M\times N}$ denotes the measurement noise. The signal to noise ratio (SNR) is defined as $10\log_{10}(\|{\bf W}{\bf H}\|_{F}^{2}/\|{\bf V}\|_{F}^{2})$ dB. We follow the same procedure as in [10] to generate ${\bf W}$ , ${\bf V}$ and the cluster assignment matrix ${\bf H}$ , with $M=2000$ , $N=1000$ and $K=10$ (10 clusters). The number of data samples in the 10 clusters are $117,62,36,124,15,24,119,43,122$ and $338$ , respectively. Like [10], $5\%$ of the data samples are replaced by randomly generated outliers.

Parameter setting: In Algorithm 1, the initial penalty parameter $\rho$ is set to $10^{-8}$ . For Step 4 of Algorithm 1, the orthogonality of ${\bf H}^{(r)}$ is measured by

[TABLE]

where ${\bf Q}^{(r)}$ is a diagonal matrix such that rows of ${\bf Q}^{(r)}{\bf H}^{(r)}$ have unit 2-norm. One updates $\rho=\gamma\rho$ in Step 4 of Algorithm 1 whenever $\epsilon_{\rm orth}\geq 10^{-10}$ . Besides, we define

[TABLE]

The stopping condition of Algorithm 1 is when both $\epsilon_{\rm orth}$ and $\epsilon_{\rm NR}$ are sufficiently small.

For Algorithm 2, $t^{k}$ and $c^{k+1}$ are chosen as $\frac{1}{2}\lambda_{\max}(\nabla_{{\bf H}}^{2}G_{\rho}({\bf W}^{k},{\bf H}^{k}))$ and $\frac{1}{2}\lambda_{\max}(\nabla_{{\bf W}}^{2}G_{\rho}({\bf W}^{k},{\bf H}^{k+1}))$ , respectively; while for Algorithm 3, $t^{k}$ and $c^{k+1}$ are chosen as $\lambda_{\max}(\nabla_{{\bf H}}^{2}\widetilde{F}({\bf W}^{k},{\bf H}^{k}))$ and $\frac{1}{2}\lambda_{\max}(\nabla_{{\bf W}}^{2}F_{\rho}\\ ({\bf W}^{k},{\bf H}^{k+1}))$ , respectively. The stopping condition for Algorithm 2 and Algorithm 3 is the normalized residual of $({\bf W}^{k},{\bf H}^{k})$ which is defined in the same way as (19) and is denoted as $\epsilon_{\rm PALM}$ .

If not mentioned specifically, we choose $\mu_{w}=0$ and $\mu_{h}=10^{-10}$ in (9). The parameter $\gamma$ for increasing $\rho$ is set to 1.1. When it comes to the stopping condition, if not mentioned specifically, for the SNCP method, we set $\max\{\epsilon_{\rm orth},\epsilon_{\rm NR}\}\leq 10^{-5}$ for Algorithm 1 and $\epsilon_{\rm PALM}<3\times 10^{-3}$ for Algorithm 2; for the NSNCP method, we set $\max\{\epsilon_{\rm orth},\epsilon_{\rm NR}\}\leq 10^{-3}$ for Algorithm 1 and $\epsilon_{\rm PALM}<3\times 10^{-3}$ for Algorithm 3. For the four ONMF methods under comparison, the stopping condition is $\epsilon_{\rm NR}<10^{-5}$ or the maximum iteration number of 2000 is achieved. All algorithms under test are initialized with 10 common, randomly generated initial points, and the presented results are averaged over the 10 experimental trials.

Effect of $\mu_{w}$ and $\mu_{h}$ as well as $\epsilon_{\rm PALM}$ and $\gamma$ : The parameters $\mu_{w}$ and $\mu_{h}$ as well as $\epsilon_{\rm PALM}$ and $\gamma$ do have impact on the algorithm convergence. To justify our choices above, we conducted extensive experiments with different combinations of these parameters. The numerical results are presented in [36, Section 1] of the supplementary material. The messages that one can infer from these experiments are that 1) smaller values of $\mu_{w}$ and $\mu_{h}$ are preferred, and 2) a larger $\gamma$ can speed up to satisfy the orthogonality of ${\bf H}$ , but requires a smaller value of $\epsilon_{\rm PALM}$ in order to achieve a lower objective value of (9). Interested readers may refer to the supplementary material [36] for more discussions.

SNCP vs NSNCP: Let us first examine the convergence behaviors of the proposed SNCP and NSNCP methods. Fig. 1 and Fig. 1 respectively display the normalized residual and orthogonality achieved versus the iteration number of Algorithm 1 when the SNCP and NSNCP methods are used. One can see from Fig. 1 that both SNCP and NSNCP converge and they can converge faster for smaller values of $\epsilon_{\rm PALM}$ . As seen in Fig. 1, both methods indeed can achieve an orthogonal ${\bf H}$ . Moreoever, it can be observed that the NSNCP usually converge faster, and it takes about 130 iterations to reach $\epsilon_{\rm orth}<10^{-15}$ , which is faster than the SNCP method.

To further examine the difference between the SNCP and NSNCP methods, we plot in Fig. 2 the convergence curves of clustering accuracy (ACC) versus the iteration number of Algorithm 1 for both the SNCP and NSNCP methods, on the synthetic data with SNR = $-3$ dB. The results are averaged over 10 experiments, each of which uses a different initial point for Algorithm 1. The error bars (dashed line) are the standard deviation of the 10 experimental results. Intriguingly, one can observe that the NSNCP method actually converges faster than the SNCP method in terms of clustering accuracy, which echos Theorem 3 that the NSNCP method requires a finite $\rho$ only to achieve a feasible and meaningful solution of (9). However, as seen from the figure, the SNCP method eventually can reach a slightly higher clustering accuracy than the NSNCP method on the synthetic data.

In Fig. 2(b), we further present the clustering accuracy (ACC) versus the stopping condition $\epsilon_{\rm PALM}$ of these two methods. One can observe that the clustering accuracy improves with a smaller $\epsilon_{\rm PALM}$ for both methods. We also see that the performance gap between the two methods reduces with smaller $\epsilon_{\rm PALM}$ , and when $\epsilon_{\rm PALM}<10^{-3}$ , they yield comparable clustering performance. This implies that the NSNCP method may require a more stringent $\epsilon_{\rm PALM}$ in order to reach a desired clustering performance.

Comparison with the existing ONMF methods: We compare the convergence speed of the proposed SNCP/NSNCP methods with the ONMF methods. Fig. 3(a) and Fig. 3(b) respectively show the average normalized residual achieved versus the (total) iteration number. For the SNCP/NSNCP, the iteration number refers to the accumulated iteration number of Algorithm 2 (resp. Algorithm 3) when the outer loop Algorithm 1 is applied. It can be observed from Fig. 3 that these traditional ONMF methods suffer from slow convergence, whereas the SNCP/NSNCP can converge faster although at the beginning their movements are not so fast. Table II further shows the average iteration numbers of all methods under test to reach the normalized residual $\epsilon_{\rm NR}<10^{-5}$ . One can see that DTPP and ONMF-S both require more than 2000 iterations while the HALS requires less iteration numbers on average. The SNCP/NSNCP both require less iteration numbers than the other three ONMF methods, and the NSNCP has the least.

Clustering performance: Table I lists the average clustering performance (ACC) of the eight methods under test on the synthetic data with different SNR values. All results are obtained by averaging over 20 experiments. In each experiment, all methods use the same randomly generated initial point.

First of all, one can observe that K-means++ does not perform better than the K-means for the synthetic dataset. The ONMF based methods (i.e., DTPP, ONP-MF, ONMF-S, HALS, and proposed SNCP/NSNCP) significantly outperform the K-means and K-means++. Nevertheless, one can see from Table I that the proposed SNCP and NSNCP consistently yield the best clustering performance.

Computational time: Table I also shows the CPU time taken by each method. The computer used in the experiments has a Ubuntu 16.04 OS, and equipped with 3.40 GHz Intel Core i7-6700 CPU and 52 GB RAM. As seen, other than the K-means and K-means++ which are well known computationally cheap, the proposed SNCP and NSNCP methods are more computationally time efficient than the other five ONMF based methods. In particular, while the ONP-MF can provide competitive clustering performance when SNR $\geq 1$ dB, its computation time is long. By contrast, the HALS can provide reasonably good clustering performance when SNR $\geq 1$ dB and its computation time is moderate. Lastly, we note that the NSNCP method is slightly faster than the SNCP method, though the latter can provide the best clustering performance. All these results are consistent with the discussion in Remark 3 and the results in Table II.

Clustering stability: We evaluate the stability of the clustering methods against different initial points. In particular, we adopt the consensus map and cophenetic correlation (CC) coefficient [5] to measure the stability. The consensus map is based on the consensus matrix ${C}\in\mathbb{R}^{n\times n}$ , which has each entry $[C]_{i,j}=1$ if sample $i$ and sample $j$ is assigned to the same cluster, and entry $[C]_{i,j}=0$ otherwise. Then the consensus map, denoted by $\bar{C}$ , is the heatmap of the average of consensus matrices obtained by 10 runs of experiments with different initial points. Thus, roughly speaking, in the consensus map $\bar{C}$ , the $(i,j)$ th entry will be close to 1 if sample $i$ and sample $j$ are consistently assigned to the same cluster even under different initial conditions. The CC coefficient (between 0 and 1) is a qualitative measure of the consensus matrix and it is defined as the Pearson correlation of two distance matrices of data samples: one is given by using the off-diagonal entries of $\bar{C}$ and the other one utilizes the cophenetic distances of samples after performing average linkage hierarchical clustering; details can be referred to [39]. The CC coefficient would approach 1 if the consensus map $\bar{C}$ have entries closer to [math] or $1$ .

Fig. 4 shows the results for SNR = $-3$ dB, and one can see that the proposed SNCP method gives the stablest clustering results, which is followed by the NSNCP method. In particular, clustering inconsistency only happens in small-sized clusters for these two methods.

V-B Application to Biological Data Analysis

In the experiment, we apply the various clustering methods to the The Cancer Genome Atlas (TCGA) database [30, 31], which contains the expression data of 20531 genes on 11135 cancer samples belonging to 33 cancer types. We consider 5 subsets of the TCGA dataset with different numbers of cancer types as shown in the 3nd row of Table III. The Pearson’s Chi-Squared Test [40] is applied to select 5000 genes ( $M=5000$ ) for each data sample.

For the proposed SNCP method and NSNCP method, an additional constraint is added to upper bound each entry of ${\bf W}$ by the maximum value of the data. The other parameters are set the same as that in Section V-A.

Performance comparison: One can see from Table III that the proposed NSNCP method performs best on the TCGA data, especially for the first three datasets. The SNCP method provides very close performances as the NSNCP method on datasets 2, 3 and 4. Compared with the ONMF based methods, the K-means and K-means++ yield relatively poor performance on this TCGA data. In terms of computation time, the HALS is still the most time efficient one, especially when the sample size is small. However, there exists considerable performance gap between HALS and the proposed NCP methods and the computation advantage of HALS can diminish when the data size increases. Lastly, one can observe from the table that the other three ONMF methods are quite computationally expensive, particularly the ONP-MF and ONMF-S.

V-C Application to Document Clustering

We here examine the performance of the proposed methods on the document dataset TDT2 corpus [32] which consists of 10212 on-topic documents in total with 56 semantic categories. We extract 6 subsets, each of which contains 10 randomly picked categories ( $K=10$ ). Each document sample is normalized and represented as a term-frequency-inverse-document-frequency (tf-idf) vector [32]. The dimension of each test data is shown in the first three rows of Table IV. In the experiment, we set $\mu_{w}=0$ and $\mu_{h}=10^{-8}$ for the proposed SNCP and NSNCP methods.

Performance comparison: As seen from Table IV, except for Datasets 5 and 6, the ONMF based methods yield better performance than the K-means/K-means++. In particular, K-means++ yields the highest clustering accuracy on Dataset 5, and K-means gives comparable performance on Dataset 6. A close inspection shows that for Dataset 5, the initial points picked by K-means++ happen to be close to the cluster centroids of the ground truth.

In addition, we can see that, except for Datasets 3 and 5, the SNCP method provides higher accuracy than the ONMF-S/ONP-MF. Besides, as seem from Table IV, the computation time of the ONMF-S/ONP-MF are large and can be 50 times slower than the SNCP method on Dataset 6. It is also seen that the HALS is most time efficient on this experiment although it only provides moderate clustering performance. While the NSNCP method does not perform as well as its smooth counterpart, it gives comparable clustering performance for the first four dataset, and is computationally time efficient compared to ONMF-S and ONP-MF.

Clustering on dimension-reduced data:

Since the feature size of the TDT2 data is large and in view of the fact that dimension reduction techniques can extract low-rank structure of data and improve the clustering performance, we apply the spectral clustering (SC) [12] to the TDT2 data followed by applying the various clustering methods to the dimension-reduced data. In the experiment, we set $\mu_{w}=\mu_{h}=0$ and each entry of ${\bf W}$ is lower (resp. upper) bounded by the minimum (resp. maximum) value of the dimension-reduced data. The experimental results are displayed in Table V. As observed, all methods have a great leap in the clustering performance when compared to Table IV, and there is no significant performance gap between various methods. Nevertheless, the proposed SNCP method can provide competitive clustering results and performs best on Dataset 1, 2, and 6. Although the ONP-MF performs best on Dataset 3 and 5, it remains to be expensive in computation time. While the K-means++ gives the best performance on Dataset 4, one can see that the proposed SNCP method is only slighter worse.

To look into the reason why SC can greatly improve the clustering performance, we employ t-SNE [41] 333Although the performance of t-SNE may vary with different initial points and hyperparameters, we have tested different initial points in the experiments and found most of the results are consistent to each other. and visualize some of the TDT2 datasets in Fig. 5. Here, TDT2_2 denotes Dataset 2 in Table IV while TDT2_DR2 denotes dimension-reduced Dataset 2 by SC. We can see from Fig. 5 that, although all of Dataset 2, 4 and 6 exhibit much clearer cluster structures after dimension reduction by SC, data samples in both TDT2_DR2 and TDT2_DR6 still overlap with each other which is challenging for K-means and K-means++. This may explain to some extent why all methods in Table V have significantly improved clustering performance comparing to Table IV, and why the K-means++ does not perform as well as the SNCP on TDT2_DR2 and TDT2_DR6.

In particular, we see that data samples in TDT2_DR4 in Fig. 5(e) becomes almost separable, and therefore the K-means++ can achieve a nearly perfect performance and outperforms the SNCP method. By contrast, as seen in Fig. 5(d) and 5(f), data samples in both TDT2_DR2 and TDT2_DR6 still overlap with each other after dimension reduction, and their data samples are unevenly spread with complex shapes. Such case is known to be challenging for K-means and K-means++ [16]. As seen from Table V, the SNCP method, which outperforms the K-means++ on the two datasets, is more capable of capturing complicated cluster structures.

VI Conclusion

In this paper, we have proposed the ONMF based data clustering formulation (9) and the NCP approach (11) and Algorithm 1. We have considered one smooth NCP formulation (12) and one non-smooth NCP formulation (13), and analyzed the theoretical conditions of the two methods for which a feasible and meaningful solution of (9) can be obtained. Efficient implementations of the proposed methods based on the PALM algorithm (Algorithm 2 and Algorithm 3) have also been devised.

Extensive experiments have been conducted based on the synthetic data and real data TCGA and TDT2. In particular, when comparing to the existing K-means and ONMF based methods, the proposed methods can perform either significantly better or comparably in terms of clustering accuracy, while being much more time efficient than most of the other ONMF based methods.

It is worthwhile to mention some interesting directions for future research. Firstly, while the current PALM algorithms seem to work well, it is possible to employ some more advanced optimization algorithms such as Nesterov’s accelerated method [42, 43, 44] to improve algorithm convergence speed. Secondly, while parallel and distributed implementations of the proposed methods are feasible, it is important to reduce the communication overhead between cluster nodes for large-scale scenarios [45, 46]. It is also interesting to study joint dimension reduction and clustering methods [10] using the proposed NCP methods, and extend the current Euclidean distance measure to more general cost functions such as the $\beta$ -divergence [47, 48] or the cohesion measure in [49].

Appendix A Proof of Proposition 1

Proof: Since $(\overline{}{\bf W},\overline{}{\bf H})$ is a B-stationary point of problem (7) in the manuscript, we have

[TABLE]

Suppose that there exist $j^{\prime}\in{\mathcal{N}}$ such that $\overline{}{\bf h}_{j^{\prime}}={\bm{0}}$ . Then, (20) becomes

[TABLE]

Suppose that $((\overline{}{\bf W})^{T}{\bf x}_{j^{\prime}})_{\ell}>0$ for some $\ell\in{\mathcal{K}}$ . Choose a direction $\overline{}{\bm{D}}$ such that $\overline{}{\bm{d}}_{j}={\bm{0}},\forall j\neq j^{\prime}$ , $\overline{}{\bm{d}}_{j^{\prime}}=\alpha{\bm{e}}_{\ell}$ for some $\alpha>0$ . It is obvious to see that $\overline{}{\bm{D}}\in{\bm{T}}_{\mathcal{H}^{v}_{p,q}}(\overline{}{\bf H})$ based on Definition 1. By substituting $\overline{}{\bm{D}}$ into (A), we obtain

[TABLE]

which, however, is a contradiction. $\blacksquare$

Appendix B Proof of Theorem 1

As $({\bf W}^{\star},{\bf H}^{\star})$ is a local minimizer of problem (12), there exists a neighborhood of ${\bf H}^{\star}$ , denoted by ${\mathcal{N}}_{\epsilon}({\bf H}^{\star})=\{{\bf H}\geq{\bm{0}}~{}|~{}\|{\bf H}-{\bf H}^{\star}\|_{F}^{2}\leq\epsilon\}$ , such that

[TABLE]

Suppose there exists a $j^{\prime}$ such that ${\bf 1}^{\top}{\bf h}_{j^{\prime}}^{\star}\neq\|{\bf h}_{j^{\prime}}^{\star}\|_{2}$ . Let $c\triangleq{\bf 1}^{\top}{\bf h}_{j^{\prime}}^{\star}$ ,

[TABLE]

and define a hyperplane $\mathcal{H}_{c}\triangleq\{{\bf H}\geq{\bm{0}}~{}|~{}{\bf 1}^{\top}{\bf h}_{j^{\prime}}=c\}$ . We will show a contradiction that there exists a neighbor of ${\bf H}^{\star}$ that lies in the hyperplane $\mathcal{H}_{c}$ and achieves a smaller objective than ${\bf H}^{\star}$ .

To the end, for $\ell=1,\ldots,|\Phi|$ , let

[TABLE]

which is obtained by replacing the $j^{\prime}$ th column of ${\bf H}^{\star}$ by

[TABLE]

where $0<\alpha_{\ell}<1$ and ${\bm{s}}^{({\ell})}\triangleq c{\bm{e}}_{i_{\ell}}.$ One can see that ${\bf H}^{(\ell)}\in\mathcal{H}_{c}$ since

[TABLE]

Firstly, we show that ${\bf H}^{\star}$ is a convex combination of ${\bf H}^{({\ell})},\forall\ell=1,\ldots,|\Phi|$ . By (25), we have ${\bm{s}}^{({\ell})}=\frac{{\bf h}_{j^{\prime}}^{({\ell})}-(1-\alpha_{\ell}){\bf h}_{j^{\prime}}^{\star}}{\alpha_{\ell}}$ . Thus, we can obtain

[TABLE]

where $\beta_{\ell}\triangleq\frac{[{\bf H}^{\star}]_{{i_{\ell}},j^{\prime}}}{c}$ . Rearranging terms in (27) gives rise to

[TABLE]

Notice that $\sum_{{\ell}=1}^{|\phi|}\beta_{\ell}=\sum_{{\ell}=1}^{|\phi|}\frac{[{\bf H}^{\star}]_{{i_{\ell}},j^{\prime}}}{c}=\frac{{\bf 1}^{\top}{\bf h}_{j^{\prime}}^{\star}}{c}=1$ . So (28) reduces to

[TABLE]

which implies that ${\bf H}^{\star}$ is a convex combination of ${\bf H}^{(\ell)},{\ell}=1,\ldots,|\Phi|$ .

Secondly, we show that $G_{\rho}({\bf W}^{\star},{\bf H})$ is strongly concave with respect to ${\bf H}$ on the hyperplane $\mathcal{H}_{c}$ as long as $\rho$ is sufficiently large. It is sufficient to show that $G_{\rho}({\bf W}^{\star},{\bf H})-\frac{\rho}{2}\sum_{j=1}^{N}{\bf 1}^{\top}{\bf h}_{j}=F({\bf W}^{\star},{\bf H})-\frac{\rho}{2}\sum_{j=1}^{N}\|{\bf h}_{j}\|_{2}^{2}$ is strongly concave in ${\bf H}$ . This is true since $\nabla_{{\bf H}}F({\bf W}^{\star},{\bf H})$ is Lipschitz continuous with a bounded Lipschitz constant (denoted by $\rho^{\star}$ ), and thus $F({\bf W}^{\star},{\bf H})-\frac{\rho}{2}\sum_{j=1}^{N}\|{\bf h}_{j}\|_{2}^{2}$ will be a strongly concave function as long as $\rho>\rho^{\star}$ .

By the above two facts, we obtain that

[TABLE]

where ${\ell}^{\prime}=\arg\min_{{\ell}=1,\ldots,|\Phi|}G_{\rho}({\bf W}^{\star},{\bf H}^{({\ell})}).$ Since by (25), ${\bf H}^{({\ell^{\prime}})}$ will lie in ${\mathcal{N}}_{\epsilon}({\bf H}^{\star})$ for $\alpha_{\ell^{\prime}}<\frac{\epsilon}{c-[{\bf H}^{\star}]_{i_{\ell^{\prime}},j^{\prime}}}$ , (30) shows a contradiction to the local optimality of ${\bf H}^{\star}$ . Thus, for $\rho>\rho^{\star}$ , ${\bf H}^{\star}$ must be feasible and a local minimizer to (9). $\blacksquare$

Appendix C Proof of Theorem 2

To show that $({\bf W}^{\infty},{\bf H}^{\infty})$ is a B-stationary point to (9), it suffices to show

[TABLE]

where $\mathcal{H}^{2}_{1,2}=\{{\bf H}\geq{\bm{0}}~{}|~{}({\bf 1}^{\top}{\bf h}_{j})^{2}-\|{\bf h}_{j}\|_{2}^{2}=0,~{}\forall j\in{\mathcal{N}}\}$ . Since $({\bf W}^{\rho},{\bf H}^{\rho})$ is a stationary point of (12), we have

[TABLE]

which directly implies (31a) by taking $\rho\to\infty$ . To prove (31b), we employ the following proposition and applies it to (9) for each column of ${\bf H}$ .

Proposition 3

Consider the following problem

[TABLE]

where $f$ is smooth, and the corresponding penalized problem

[TABLE]

where $\rho>0$ is a penalty parameter. Let ${\bf x}^{\rho}$ be a stationary point of (34). Assume that ${\bf x}^{\rho}$ is bounded and ${\bf x}^{\rho}\to{\bf x}^{\infty}\neq{\bm{0}}$ as $\rho\to\infty$ . Then, ${\bf x}^{\infty}$ is a feasible B-stationary point of (33).

Proof: Since ${\bf x}^{\rho}$ is a stationary point of (34), which is also a Karush-Kuhn-Tucker (KKT) solution to (34). By the complementary slackness, ${\bf x}^{\rho}$ satisfies

[TABLE]

Let ${\bf x}^{\infty}\neq 0$ be the limit of a subsequence $\{{\bf x}^{\rho_{k}}\}$ with $\{\rho_{k}\}\to\infty$ . We then have

[TABLE]

Since $\nabla f({\bf x}^{\rho_{k}})^{\top}{\bf x}^{\rho_{k}}$ is bounded due to bounded ${\bf x}^{\rho}$ , it follows from (36) that $({\bf 1}^{\top}{\bf x}^{\rho_{k}})^{2}-\|{\bf x}^{\rho_{k}}\|_{2}^{2}\to 0$ as $\rho_{k}\to\infty$ . Hence, ${\bf x}^{\infty}$ is feasible to (33), and all components of ${\bf x}^{\,\infty}$ are zero except one component, say ${\bf x}^{\,\infty}_{\bar{i}}>0$ .

We next show that

[TABLE]

i.e., ${\bf x}^{\infty}$ is a B-stationary point of (33). Before that, we claim that the tangent cone ${\mathcal{T}}_{{\mathcal{X}}}({\bf x}^{\,\infty})=\left\{{\bm{d}}\mid d_{i}=0,\forall i\neq\bar{i}\right\}$ . Indeed, any ${\bm{d}}\in{\mathcal{T}}_{{\mathcal{X}}}({\bf x}^{\,\infty})$ must satisfy $d_{i}\geq 0$ for all $i\neq\bar{i}$ and ${\bf 1}^{\top}{\bm{d}}=\frac{{\bm{d}}^{\top}{\bf x}^{\infty}}{\|{\bf x}^{\infty}\|_{2}}={\bm{d}}_{\bar{i}}$ . This yields $d_{i}=0$ for all $i\neq\bar{i}$ ; thus the tangent cone is contained in the right-hand cone. The reverse inclusion is obvious because ${\bf x}^{\infty}+\tau{\bm{d}}\in{\mathcal{X}}$ for all sufficiently small $\tau>0$ . Thus, (37) boils down to

[TABLE]

We next prove that there cannot exist an index $\bar{j}\neq\bar{i}$ and a subsequence $\{{\bf x}^{k}\triangleq{\bf x}^{\rho_{k}}\}_{k\in\kappa}$ such that $x^{k}_{\bar{j}}>0$ for all $k\in\kappa$ . If not, we may assume without loss of generality that this subsequence is the entire sequence $\{{\bf x}^{k}\}$ . Thus $x^{k}_{\bar{j}}>0$ for all $k$ . This is in addition to $x^{k}_{\bar{i}}>0$ for all sufficiently large $k$ . By the complementary slackness, we have

[TABLE]

Since $\frac{\partial f({\bf x}^{\,k})}{\partial x_{\bar{j}}}$ is bounded, and $\rho_{k}\to\infty$ and $x_{\bar{i}}^{k}\to x_{\bar{i}}^{\infty}>0$ as $k\to\infty$ , we arrive at a contraction after taking the limit on both sides of inequality (40). Hence, for any converging subsequence $\{{\bf x}^{\rho_{k}}\}$ , there are only finite terms that have at least one more nonzero component in addition to $\bar{i}$ .

By restricting to a further subsequence of $\{{\bf x}^{\rho_{k}}\}$ satisfying $x^{\rho_{k}}_{\bar{i}}>0$ and $x^{\rho_{k}}_{i}=0$ for all $i\neq\bar{i}$ , we have

[TABLE]

which implies (38), and the proof is complete. $\blacksquare$

Appendix D Proof of Theorem 3

Analogous to the proof of Theorem 2, it is sufficient to consider the following proposition.

Proposition 4

Consider the following problem

[TABLE]

where $f$ is smooth, and the corresponding penalized problem

[TABLE]

where $\rho>0$ is a penalty parameter. Let ${\bf x}^{\rho}$ be a d-stationary point of (43). Assume that ${\bf x}^{\rho}$ is bounded. Then there exists a finite $\rho^{\star}$ such that for all $\rho>\rho^{\star}$ , ${\bf x}^{\rho}$ is a feasible B-stationary point to problem (42).

Proof: Denote $\phi({\bf x})\triangleq{\bf 1}^{\top}{\bf x}-\|{\bf x}\|_{\infty}$ . Since ${\bf x}^{\rho}$ is a d-stationary point of (43), we have

[TABLE]

where ${\mathcal{T}}_{+}({\bf x}^{\rho})=\{{\bm{d}}~{}|~{}d_{i}\geq 0~{}\text{if}~{}x_{i}=0\}$ . Because

[TABLE]

and ${\bm{e}}_{i^{\star}}\in\partial\|{\bf x}^{\rho}\|_{\infty}$ , where $\displaystyle i^{\star}\in\arg\max_{i}x_{i}^{\rho}$ , condition (44) implies

[TABLE]

On the other hand, for bounded ${\bf x}^{\rho}$ , we can upper bound

[TABLE]

By substituting (47) into (46), we obtain

[TABLE]

We argue that $\phi({\bf x}^{\rho})=0$ for $\rho>\rho^{\star}$ . Suppose not. Let us choose a tangent $\bar{}{\bm{d}}\in{\mathcal{T}}_{+}({\bf x}^{\rho})$ as follows

[TABLE]

Substituting (50) into (49) gives rise to

[TABLE]

which is a contradiction since $\rho>\rho^{\star}$ . Thus $\phi({\bf x}^{\rho})=0$ must hold, i.e., ${\bf x}^{\rho}$ is feasible to (42) if $\rho>\rho^{\star}$ .

Next, we show that ${\bf x}^{\rho}$ is a B-stationary point to (42). Since ${\bf x}^{\rho}$ is feasible to ${\mathcal{X}}$ in (42) for $\rho>\rho^{\star}$ . Thus, the tangent cone of ${\mathcal{X}}$ at ${\bf x}^{\rho}$ is given by

[TABLE]

Besides, one can verify that for all ${\bm{d}}\in{\mathcal{T}}_{{\mathcal{X}}}({\bf x}^{\rho})$ ,

[TABLE]

Therefore, by applying (56) and (57) to (44), we obtain

[TABLE]

which is the desired result. $\blacksquare$

Appendix E Proof of Proposition 2

Denote $f({\bf x})=\frac{1}{2}\|{\bf x}-{\bm{y}}\|_{2}^{2}-c\|{\bf x}\|_{\infty}$ . Since ${\bf x}^{\star}$ is an optimal solution of (16), we have

[TABLE]

where $f^{\prime}({\bf x};{\bm{d}})$ is the directional derivative of $f({\bf x})$ at direction ${\bm{d}}$ , and

[TABLE]

As the directional derivative of $\|{\bf x}\|_{\infty}$ is $\max_{{\bm{g}}\in\partial(\|{\bf x}\|_{\infty})}{\bm{g}}^{T}{\bm{d}}$ , we have from (59) that

[TABLE]

Let $i^{\star}$ be an index such that $({\bf x}^{\star})_{i^{\star}}=\|{\bf x}^{\star}\|_{\infty}$ . Then, ${\bm{e}}_{i^{\star}}\in\partial(\|{\bf x}^{\star}\|_{\infty})$ , and (61) implies

[TABLE]

Now let us consider the case that $({\bf x}^{\star})_{i^{\star}}=0$ . Then, by (60), it holds that

[TABLE]

Substituting (63) into (62) gives rise to

[TABLE]

which implies $y_{i^{\star}}+c\leq 0.$ On the other hand, suppose $({\bf x}^{\star})_{i^{\star}}\neq 0$ . Then,

[TABLE]

Substituting (65) into (62) gives rise to

[TABLE]

which implies $({\bf x}^{\star})_{i^{\star}}=y_{i^{\star}}+c.$ By combining the above two implications, we obtain $({\bf x}^{\star})_{i^{\star}}=(y_{i^{\star}}+c)^{+}$ . In addition, by following a similar argument as in (63) to (66), one can obtain $({\bf x}^{\star})_{i}=[y_{i^{\star}}]_{+}$ for all $i\neq i^{\star}$ . Therefore, (17) is true and is recapitulated here

[TABLE]

for $i=1,\ldots,n$ . If one substitutes the optimal ${\bf x}^{\star}$ in (67) into the objective of (16), one can easily verify that the index $i^{\star}$ must be unique.

Next, let us show that

[TABLE]

We respectively consider three cases.

Case (a): ${\bm{y}}\geq{\bm{0}}$ : For ${\bm{y}}\geq{\bm{0}}$ , and by (67), the optimal objective value of (16) is given by

[TABLE]

Obviously, we must have $i^{\star}\in\arg\max_{i=1,\ldots,n}y_{i}$ .

Case (b): ${\bm{y}}<{\bm{0}}$ : If ${\bm{y}}<{\bm{0}}$ and $y_{i}+c\leq 0,\forall i=1,\ldots,n$ , we have the trivial solution ${\bf x}^{\star}={\bm{0}}$ according to (67) and the definition $({\bf x}^{\star})_{i^{\star}}=\|{\bf x}^{\star}\|_{\infty}$ .

Suppose that ${\bm{y}}<{\bm{0}}$ and the set ${\mathcal{I}}=\{i\in\{1,\ldots,n\}|y_{i}+c>0\}$ is non-empty. We claim that $i^{\star}\in\arg\max_{i\in{\mathcal{I}}}y_{i}$ . Suppose $i^{\star}\notin{\mathcal{I}}$ . Then we have ${\bf x}^{\star}={\bm{0}}$ and it leads to

[TABLE]

Suppose $i^{\star}\in{\mathcal{I}}$ . Then by (67), we have

[TABLE]

From (70) and (E), It is apparent that $f({\bm{0}})>f({\bf x}^{\star})$ , and thereby $i^{\star}\in{\mathcal{I}}$ .

Further consider two possible solutions, namely, ${\bf x}_{1}^{\star}$ and ${\bf x}_{2}^{\star}$ with $({\bf x}^{\star}_{1})_{i_{1}}=\|{\bf x}^{\star}_{1}\|_{\infty}$ and $({\bf x}^{\star}_{2})_{i_{2}}=\|{\bf x}^{\star}_{2}\|_{\infty}$ , respectively, where $i_{1},i_{2}\in{\mathcal{I}}$ . Then, by (E), we have

[TABLE]

whenever $y_{i_{2}}<y_{i_{1}}$ since $\frac{1}{2}(y_{i_{2}}+y_{i_{1}})+c>0$ for $i_{1},i_{2}\in{\mathcal{I}}$ . In other words, $i^{\star}$ with larger $y_{i^{\star}}$ leads to a smaller objective value, and thus we conclude that $i^{\star}\in\arg\max_{i\in{\mathcal{I}}}y_{i}$ .

Case (c): ${\bm{y}}\ngeq{\bm{0}}$ and ${\bm{y}}\nless{\bm{0}}$ . There are three possible situations: 1) $i^{\star}\in{\mathcal{I}}_{0}=\{i|y_{i}<0,y_{i}+c\leq 0\}$ ; 2) $i^{\star}\in{\mathcal{I}}_{1}=\{i|y_{i}\geq 0\}$ ; and 3) $i^{\star}\in{\mathcal{I}}_{2}=\{i|y_{i}<0,y_{i}+c>0\}$ .

Suppose $i^{\star}\in{\mathcal{I}}_{0}$ . Then ${\bf x}^{\star}={\bm{0}}$ , and it leads to (70).
Suppose $i^{\star}\in{\mathcal{I}}_{1}$ . Then, by (67), we have

[TABLE]

Clearly, $i^{\star}\in\arg\max_{i\in{\mathcal{I}}_{1}}y_{i}$ .

If $i^{\star}\in{\mathcal{I}}_{2}$ . Then, by (67), we have

[TABLE]

Based on a similar argument as in (E), one can conclude $i^{\star}\in\arg\max_{i\in{\mathcal{I}}_{2}}y_{i}$ .

From (73) and (74), it is obvious to that $f({\bm{0}})>f({\bf x}^{\star})$ for $i^{\star}\in{\mathcal{I}}_{1}\cup{\mathcal{I}}_{2}$ . Besides, consider two possible solutions, namely, ${\bf x}_{1}^{\star}$ with $({\bf x}^{\star}_{1})_{i_{1}}=\|{\bf x}^{\star}_{1}\|_{\infty}$ where $i_{1}\in{\mathcal{I}}_{1}$ , and ${\bf x}_{2}^{\star}$ with $({\bf x}^{\star}_{2})_{i_{2}}=\|{\bf x}^{\star}_{2}\|_{\infty}$ where $i_{2}\in{\mathcal{I}}_{2}$ . Then, by (73) and (74), we have

[TABLE]

where the last inequality holds since $y_{i_{2}}<0,y_{i_{1}}>0$ and $\frac{1}{2}y_{i_{2}}+c>y_{i_{2}}+c>0$ . Thus, we obtain $i^{\star}=i_{1}\in\arg\max_{i\in{\mathcal{I}}_{1}}y_{i}$ .

By summarizing the results from Case (a) to (c), we conclude that (68) holds true. Proposition 2 is proved. $\blacksquare$

Appendix F Proof of Theorem 5

By a similar argument as the proof of Theorem 4, we obtain that the sequence $({\bf W}^{k},{\bf H}^{k})$ generated by Algorithm 3 with $t^{k}>L_{\widetilde{F}}({\bf W}^{k})$ and $c^{k}>\frac{L_{F}({\bf H}^{k})}{2}$ is a bounded sequence. We assume that $({\bf W}^{\infty},{\bf H}^{\infty})$ is a limit point of $({\bf W}^{k},{\bf H}^{k})$ , and $t^{\infty}>L_{\widetilde{F}}({\bf W}^{\infty})$ is a limit value of $t^{k}$ .

At iteration $k$ , since $\widetilde{F}({\bf W}^{k},{\bf H})$ is $L_{\widetilde{F}}({\bf W}^{k})$ -smooth and $t^{k}>L_{\widetilde{F}}({\bf W}^{k})$ , by the descent lemma [29, Lemma 3.2], we have

[TABLE]

Since ${\bf H}^{k+1}$ is the optimal solution of (13), we have

[TABLE]

By (F), (76) further infers

[TABLE]

By taking the limits of both sides of (78) as $k\rightarrow\infty$ , we have

[TABLE]

which implies that

[TABLE]

The optimality condition of (80) yields

[TABLE]

Similarly, for the update of ${\bf W}$ , one can show that

[TABLE]

Combing (81) and (82), we obtain that $({\bf W}^{\infty},{\bf H}^{\infty})$ is a d-stationary point to problem (13). $\blacksquare$

Bibliography49

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] C. C. Aggarwal and C. K. Reddy, Data Clustering: Algorithms and Applications . Boca Raton, FI, USA: Chapman & Hall/CRC Press, 2013.
2[2] S. Renaud-Deputter, T. Xiong, and S. Wang, “Combining collaborative filtering and clustering for implicit recommender system,” in Proc. IEEE AIAN , Barcelona, Spain, Mar. 25-28 2013, pp. 748–755.
3[3] J. Das, P. Mukherjee, S. Majumder, and P. Gupta, “Clustering-based recommender system using principles of voting theory,” in Proc. IEEE IC 3I , Mysore, India, Nov. 27-29 2014, pp. 230–235.
4[4] C.-H. Zheng, D.-S. Huang, L. Zhang, and X.-Z. Kong, “Tumor clustering using nonnegative matrix factorization with gene selection,” IEEE. Transactions on Information Technology in Biomedicine , vol. 13, no. 4, pp. 599–607, Jul. 2009.
5[5] J.-P. Brunet, P. Tamayo, T. R. Golub, and J. P. Mesirov, “Metagenes and molecular pattern discovery using matrix factorization,” in Proc. Natl. Acad. Sci. USA , Mar. 23 2004, pp. 4164–4169.
6[6] S. Wang, P. Wu, M. Zhou, T.-H. Chang, and S. Wu, “Cell subclass identification in single-cell RNA-sequencing data using orthogonal non-negative matrix factorization,” in Proc. IEEE ICASSP , Calgary, Canada, Apr. 15-20 2018, pp. 876–880.
7[7] S. Lloyd, “Least squares quantization in PCM,” IEEE Trans. Info. Theory , vol. 28, no. 2, pp. 129–137, Mar. 1982.
8[8] H. Ding, Y. Liu, L. Huang, and J. Li, “K-means clustering with distributed dimensions,” in Proc. ICML , New York, USA, Jun. 19-24 2016, pp. 1339–1348.