TL;DR
This paper introduces a novel non-convex penalty approach for orthogonal NMF clustering that transforms orthogonality constraints into norm-based penalties, enabling scalable and efficient solutions that outperform existing methods.
Contribution
The paper proposes a new formulation of orthogonal NMF using non-convex penalties, along with efficient algorithms and theoretical conditions for feasible solutions.
Findings
The proposed NCP methods are computationally efficient.
They match or outperform existing clustering methods.
Experimental results validate the effectiveness of the approach.
Abstract
The non-negative matrix factorization (NMF) model with an additional orthogonality constraint on one of the factor matrices, called the orthogonal NMF (ONMF), has been found a promising clustering model and can outperform the classical K-means. However, solving the ONMF model is a challenging optimization problem because the coupling of the orthogonality and non-negativity constraints introduces a mixed combinatorial aspect into the problem due to the determination of the correct status of the variables (positive or zero). Most of the existing methods directly deal with the orthogonality constraint in its original form via various optimization techniques, but are not scalable for large-scale problems. In this paper, we propose a new ONMF based clustering formulation that equivalently transforms the orthogonality constraint into a set of norm-based non-convex equality constraints. We…
| SNR (dB) | -5 | -3 | -1 | 1 | 3 | |
| ACC | KM | 63.4 | 69.7 | 74.7 | 74.3 | 75.6 |
| KM++ | 64.1 | 70.1 | 70.3 | 70.8 | 69.2 | |
| DTPP | 81.6 | 85.1 | 85.9 | 87.0 | 86.6 | |
| ONP-MF | 66.8 | 88.3 | 83.1 | 89.9 | 90.0 | |
| ONMF-S | 77.4 | 79.0 | 80.2 | 81.3 | 81.7 | |
| HALS | 76.3 | 86.0 | 88.1 | 89.6 | 89.4 | |
| SNCP | 91.5 | 91.9 | 92.0 | 92.5 | 92.8 | |
| NSNCP | 90.1 | 90.7 | 91.8 | 92.0 | 92.2 | |
| Time | KM | 3.10 | 1.96 | 1.52 | 1.16 | 1.11 |
| KM++ | 3.09 | 2.44 | 2.05 | 1.70 | 1.59 | |
| DTPP | 30.4 | 34.0 | 38.6 | 31.1 | 35.1 | |
| ONP-MF | 1097 | 1123 | 1124 | 1153 | 1148 | |
| ONMF-S | 68.4 | 97.1 | 116 | 116 | 90.5 | |
| HALS | 22.6 | 25.6 | 23.3 | 15.7 | 19.3 | |
| SNCP | 20.6 | 15.6 | 13.4 | 13.3 | 12.4 | |
| NSNCP | 14.8 | 15.5 | 11.5 | 10.5 | 10.6 | |
| SNR (dB) | -5 | -3 | -1 | 1 | 3 | |
|---|---|---|---|---|---|---|
| #Iteration | DTPP | 2000 | 2000 | 2000 | 2000 | 2000 |
| ONMF-S | 2000 | 2000 | 2000 | 2000 | 2000 | |
| HALS | 1931 | 1788 | 1521 | 1442 | 1325 | |
| SNCP | 1921 | 1601 | 1401 | 1310 | 1260 | |
| NSNCP | 1478 | 1322 | 1130 | 1063 | 1047 | |
| Dataset | 1 | 2 | 3 | 4 | 5 | |
| #samples | 1667 | 3086 | 3660 | 5314 | 11135 | |
| #cancers | 5 | 10 | 15 | 20 | 33 | |
| ACC | KM | 75.0 | 67.0 | 57.0 | 52.2 | 34.4 |
| KM++ | 75.5 | 55.3 | 53.7 | 48.4 | 34.5 | |
| DTPP | 79.2 | 58.5 | 58.0 | 57.0 | 43.1 | |
| ONMF-S | 85.8 | 71.8 | 61.9 | 58.2 | 38.4 | |
| ONP-MF | 84.0 | 68.5 | 48.1 | 32.9 | 14.8 | |
| HALS | 86.2 | 68.8 | 56.1 | 56.3 | 39.1 | |
| SNCP | 85.6 | 79.3 | 64.0 | 61.1 | 41.1 | |
| NSNCP | 89.2 | 81.2 | 68.2 | 64.3 | 42.7 | |
| Time | KM | 3.02 | 7.54 | 12.3 | 34.7 | 115 |
| KM++ | 3.08 | 7.83 | 26.9 | 40.3 | 331 | |
| DTPP | 249 | 434 | 1357 | 2135 | 2454 | |
| ONMF-S | 89.9 | 772 | 1440 | 2492 | 19794 | |
| ONP-MF | 4886 | 10414 | 14232 | 16092 | 77222 | |
| HALS | 2.26 | 11.4 | 98.1 | 260 | 1605 | |
| SNCP | 41.6 | 249 | 307 | 486 | 1118 | |
| NSNCP | 22.1 | 153 | 240 | 526 | 1756 | |
| Dataset | 1 | 2 | 3 | 4 | 5 | 6 | |
| #terms | 13133 | 24968 | 11079 | 20431 | 16067 | 29725 | |
| #docs | 842 | 3292 | 631 | 1745 | 1079 | 4779 | |
| ACC | KM | 77.9 | 51.7 | 84.4 | 49.3 | 61.3 | 70.2 |
| KM++ | 73.5 | 48.9 | 84.7 | 47.1 | 61.7 | 65.7 | |
| DTPP | 70.0 | 50.4 | 77.4 | 52.1 | 45.1 | 66.6 | |
| ONMF-S | 81.8 | 52.0 | 83.6 | 59.6 | 50.9 | 68.8 | |
| ONP-MF | 85.3 | 46.1 | 89.5 | 60.9 | 59.4 | 66.8 | |
| HALS | 77.9 | 47.8 | 85.1 | 54.3 | 48.7 | 64.4 | |
| SNCP | 86.1 | 56.8 | 88.1 | 62.0 | 58.2 | 70.7 | |
| NSNCP | 79.7 | 54.1 | 88.6 | 60.8 | 56.4 | 64.8 | |
| Time | KM | 13.6 | 195 | 6.82 | 65.0 | 27.4 | 318 |
| KM++ | 10.6 | 199 | 8.15 | 53.6 | 28.9 | 401 | |
| DTPP | 126 | 726 | 81 | 413 | 204 | 2282 | |
| ONMF-S | 405 | 8124 | 157 | 1915 | 633 | 17603 | |
| ONP-MF | 1200 | 7854 | 757 | 3660 | 1951 | 14302 | |
| HALS | 12.9 | 35.6 | 6.07 | 27.4 | 20.0 | 162 | |
| SNCP | 37.3 | 407 | 20.4 | 64.0 | 41.1 | 342 | |
| NSNCP | 24.7 | 335 | 16.3 | 50.0 | 37.6 | 424 | |
| Dataset | 1 | 2 | 3 | 4 | 5 | 6 | |
| Time of SC | 13.8 | 402 | 7.11 | 90.7 | 30.7 | 1114 | |
| ACC | KM | 81.6 | 69.0 | 83.1 | 86.3 | 89.1 | 80.2 |
| KM++ | 98.3 | 83.3 | 98.9 | 99.6 | 98.9 | 90.3 | |
| HALS | 95.8 | 82.8 | 97.8 | 98.8 | 99.4 | 90.8 | |
| ONP-MF | 98.6 | 82.3 | 99.0 | 99.3 | 99.5 | 89.7 | |
| SNCP | 99.0 | 84.3 | 98.0 | 99.2 | 99.3 | 92.7 | |
| Time | KM | 0.24 | 1.34 | 0.17 | 0.45 | 0.33 | 2.33 |
| KM++ | 0.25 | 1.96 | 0.18 | 0.46 | 0.34 | 2.18 | |
| HALS | 0.78 | 2.45 | 0.81 | 1.65 | 0.75 | 1.52 | |
| ONP-MF | 13.6 | 385 | 7.29 | 90.0 | 26.0 | 790 | |
| SNCP | 0.59 | 2.00 | 0.55 | 1.31 | 1.21 | 3.17 | |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Clustering by Orthogonal NMF Model and Non-Convex Penalty Optimization
Shuai Wang, Tsung-Hui Chang, Ying Cui, and Jong-Shi Pang Shuai Wang and Tsung-Hui Chang are with the Shenzhen Research Institute of Big Data and School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen 518172, China (e-mail: [email protected], [email protected]).Ying Cui is with the Department of Industrial and Systems Engineering, University of Minnesota, Minneapolis, MN 55455, USA (e-mail: [email protected]).Jong-Shi Pang is with the Department of Industrial and Systems Engineering, University of Southern California, Los Angeles, CA 90089, USA (e-mail: [email protected]).
Abstract
The non-negative matrix factorization (NMF) model with an additional orthogonality constraint on one of the factor matrices, called the orthogonal NMF (ONMF), has been found a promising clustering model and can outperform the classical K-means. However, solving the ONMF model is a challenging optimization problem because the coupling of the orthogonality and non-negativity constraints introduces a mixed combinatorial aspect into the problem due to the determination of the correct status of the variables (positive or zero). Most of the existing methods directly deal with the orthogonality constraint in its original form via various optimization techniques, but are not scalable for large-scale problems. In this paper, we propose a new ONMF based clustering formulation that equivalently transforms the orthogonality constraint into a set of norm-based non-convex equality constraints. We then apply a non-convex penalty (NCP) approach to add them to the objective as penalty terms, leading to a problem that is efficiently solvable. One smooth penalty formulation and one non-smooth penalty formulation are respectively studied. We build theoretical conditions for the penalized problems to provide feasible stationary solutions to the ONMF based clustering problem, as well as proposing efficient algorithms for solving the penalized problems of the two NCP methods. Experimental results based on both synthetic and real datasets are presented to show that the proposed NCP methods are computationally time efficient, and either match or outperform the existing K-means and ONMF based methods in terms of the clustering performance.
Keywords Data clustering, orthogonal non-negative matrix factorization, penalty method.
I Introduction
Clustering is one of the most fundamental data mining tasks and has an enormous number of applications [1]. Typically, clustering is a key intermediate step to explore underlying structure of massive data for subsequent analysis. For example, in internet applications, clustering is used to identify users of different interests who then can be provided with specific service recommendation [2, 3]. In biology, clustering can be used for pattern discovery of genes which can help identify subtypes of a certain disease or cancer [4, 5, 6].
Among the existing clustering methods, the K-means [7] is the most widely used one, thanks to its simplicity [8]. However, the K-means may not always yield satisfactory clustering results. On one hand, from an optimization perspective, the iterative steps of finding the cluster centroids and cluster assignment in K-means are equivalent to solving a binary integer constrained matrix factorization problem by alternating optimization [9, 10]. Due to the non-convex matrix factorization model and binary integer constraint, the iterates of K-means are likely to be stuck at an unsatisfactory local point, and are sensitive to the choice of initial points [11]. On the other hand, the K-means overlooks the inherent low-rank structure and prior information which are usually owned by high-dimensional real data. Therefore, various dimension-reduction techniques such as principal component analysis (PCA), spectral clustering [12], non-negative matrix factorization (NMF) [13, 14] and deep neural networks [15, 16] are proposed. However, these methods are merely used as a preprocessing stage to find a clustering-friendly representation for the data, and the K-means is still often used for clustering the dimension-reduced data. Thus, the intrinsic drawback of the K-means caused by the non-convex nature and discrete constraints is not addressed.
Recently, as an variant of NMF, the orthogonal NMF (ONMF) model has been considered for data clustering [17, 18, 19, 20, 21, 22, 6]. The ONMF model imposes an additional orthogonality constraint on one of the factor matrices in NMF. It turns out that, like the K-means, the orthogonally constrained factor matrix functions the same as an indicator matrix that shows how the data samples are assigned to different clusters [17, 21]. Therefore, the ONMF model can be regarded as a continuous relaxation (which has no discrete constraint) of the K-means. Studies on various data mining tasks have found that the ONMF model can outperform the K-means and NMF based clustering methods [23, 24, 18, 20, 21, 22, 6].
I-A Related Works
Despite its widespread use, solving the ONMF problem is challenging due to the existence of both the orthogonality and non-negativity constraints. Many of the existing ONMF algorithms extend upon the classical multiplicative update (MU) rule by Lee and Seung [13] for the vanilla NMF to accommodate the additional orthogonality constraint. For example, reference [17] penalized the orthogonality constraint followed by applying the MU rule. The authors of [19] derived the MU rule directly using the gradient vector over the Stiefel manifold. Reference [21] employed the augmented Lagrangian (AL) method that penalizes the non-negativity constraint and applies the gradient projection method for the orthogonally constrained subproblem. Since projection onto the set of orthonormal matrices involves singular value decomposition (SVD), the orthogonal nonnegatively penalized matrix factorization (ONP-MF) method in [21] can be computationally inefficient. The hierarchical alternating least squares (HALS) method proposed in [20] is claimed to achieve a better balance between the orthogonality and non-negativity constraints. The HALS method updates one column and one row of the two respective factor matrices at the same time in each iteration, subject to the orthogonality and non-negativity constraints. Since the subproblem in HALS is handled by the same MU rule in [17], it is computationally efficient in general. Reference [22] proposed to approximate the ONMF solution by solving a low-rank non-negative PCA problem, which however involves generating a large number of candidate solutions. While the method in [22] is the first that can provide provable approximation guarantee for the ONMF problem, it can be computationally inefficient especially when a high-quality solution is sought.
On the other hand, it is noticed that some recent Riemannian optimization methods on the Stiefel manifold [25, 26] can handle problems with orthogonality constraints together with non-smooth, Lipschitz-continuous regularizations. However, since the indicator function induced by the non-negativity constraint is not Lipschitz continuous, these methods are not applicable to the ONMF problem.
I-B Contributions
In this paper, we propose a new approach to handle the ONMF based data clustering problem, aiming at achieving high-quality clustering performance with reasonable computational cost. Specifically, by recognizing the fact that the coupling of the orthogonality constraint and non-negativity constraint introduces disjunctions into the problem and projection onto the orthogonal set requires expensive SVD, we avoid directly dealing with the orthogonality constraint as done in the existing methods [17, 20, 19, 21]. Instead, we propose in [27] a novel problem reformulation for the ONMF problem that replaces the orthogonality constraint by a set of norm-based (non-convex) equality constraints. The second ingredient of the proposed approach is the use of the penalty method in optimization [28] to add these non-convex norm-equality constraints as penalty terms in the objective function while keeping the non-negativity in the constraints. The advantages of the proposed method are twofold. First, the penalty method allows to find a feasible solution to satisfy the norm-equality constraints gradually in a gentle fashion, and is less likely to be stuck at bad local solutions. Second, since only simple non-negativity constraints are left, the penalized problem can be efficiently handled by the existing proximal alternating linearized minimization (PALM) method [29] without involving any SVD. As will be shown shortly, the proposed algorithms are inherently parallel and are more computationally efficient than most of the existing methods in practice.
In particular, we consider two novel types of non-convex penalty (NCP) formulations - one is smooth that has squared -norm minus squared -norm as the penalty term, and the other one is non-smooth that has -norm minus -norm as the penalty term. The two penalties lead to different theoretical properties and algorithm developments. We obtain theoretical conditions under which the two penalty methods can yield a feasible and meaningful solution to the considered ONMF based clustering problem. We also develop computationally efficient PALM algorithms to solve the penalized problems for the smooth and non-smooth NCP methods, respectively. It is worthy to mention that the penalized problem of the non-smooth NCP method is a non-convex and non-smooth problem with a negative infinity norm regularization. We propose a novel non-convex proximal operator of the negative infinity-norm function, and show that it has a simple closed-form solution.
Extensive experiments are conducted based on a synthetic data set [10], the gene dataset TCGA [30, 31] and the document dataset TDT2 [32]. The experimental results demonstrate that the proposed smooth and non-smooth NCP methods can provide either comparable or greatly better performance than the existing K-means based methods and ONMF based methods in [17, 21, 19, 20], and at the same time are more time efficient than most of the existing ONMF based methods.
Synopsis: Section II reviews the existing K-means and ONMF model, and presents the proposed clustering problem formulation. The proposed NCP optimization framework is presented in Section III, where theoretical conditions for the smooth and non-smooth NCP methods to yield feasible stationary solutions are analyzed. The PALM algorithms used for solving the smooth and non-smooth penalized problems are given in Section IV. Experimental results are presented in Section V, and lastly the conclusions are drawn in Section VI.
Notation: We use boldface lowercase letters and boldface uppercase letters to represent column vectors and matrices, respectively. denotes the set of by real-valued matrices. The th entry of matrix is denoted by ; the th element of vector is denoted by or . Superscript stands for matrix transpose. For a matrix , its column vectors are denoted by , , and its row vectors are denoted by , ; that is, We denote as a non-negative matrix, i.e., for all . denotes the all-one vector, is the all-zero vector, is the by identity matrix, and denotes the elementary vector with one in the th entry and zero otherwise. and are the matrix Frobenius norm and vector -norm, respectively. denotes the inner product between matrices and . stands for the maximum eigenvalue of matrix . For a convex function , denotes the subdifferential of at as in standard convex analysis [33]. Lastly, denotes which is a matrix that reserves the non-negative elements of .
II Data Clustering and ONMF Model
II-A K-Means and ONMF Model
Let be a non-negative data matrix that contains data samples and each of the samples has features, i.e., . The task of data clustering is to assign the data samples into a predefined number of clusters in the sense that the samples belonging to one cluster are close to each other based on certain distance metric. The most popular setting is to consider the Euclidean distance and the use of the K-means due to its simplicity.
From an optimization point of view [9, 10], the iterative procedure of K-means can be interpreted as an alternating optimization algorithm applied to the following matrix factorization problem
[TABLE]
where and . Here, columns of represent centroids of the clusters, while the matrix indicates the cluster assignment of samples. Specifically, indicates that the th sample is uniquely assigned to cluster , and otherwise.
One can see from (1) that, when is given, the optimal is obtained by assigning each sample to the cluster that has the nearest centroid, while when is given, the optimal is given by the centroid (average of samples) assigned to the th cluster, for all . The two steps are exactly the well-known K-means algorithm. However, due to the non-convex binary constraint (1b) and the hard clustering assignment during the iterative steps of K-means, the K-means is sensitive to the initial conditions and may not always yield satisfactory clustering performance [11].
Regarded as a relaxation of (1), the following vanilla NMF model is also considered for data clustering [14, 21, 10]
[TABLE]
However, since (1b) is completely removed from (1), the obtained from the above vanilla NMF model cannot reveal clear clustering assignment.
It was shown in [17, 21] that the ONMF model, which has an additional orthogonality constraint on , is closely related to the K-means model (1). In particular, the ONMF problem is given by
[TABLE]
A key observation is given as below.
Observation: Any satisfying and has at most one non-zero entry in each column.
Thus, matrix in (3) functions similarly as that in (1) whose nonzero elements indicate the cluster assignment of data samples. Nevertheless, different from (1), the non-zero entries of in (3) are not restricted to be either zero or one but can be scaled. Owing to the two facts, the ONMF model is less sensitive to data scaling and is preferred than the K-means especially for data where clustering results are independent of data scaling; see [21] for more discussions.
However, the ONMF problem (3) is intrinsically challenging to solve since the above Observation indicates that the intersection of and is mixed combinatorial due to the determination of the correct status of the variables (positive or zero). In view of this, the philosophy of the existing methods is to handle the two constraints separately either by the penalty method [17, 20, 19] or by the AL method [21] which requires repetitive projections onto the orthogonal set via expensive SVD. Unlike the existing methods, we present a novel problem reformulation of (3) which avoids dealing with the orthogonality constraint directly. Based on the reformulated problem, we propose a non-convex penalty (NCP) optimization framework that is not only amenable to efficient computation but also able to provide favorable clustering performance.
II-B Proposed Clustering Formulation
We note that any vector has at most one non-zero entry if and only if
[TABLE]
According to the Observation, any satisfying (3c) and (3b) also lies in the following set
[TABLE]
Thus, the ONMF model (3) can be equivalently written as
[TABLE]
Firstly, note that the condition , is not intrinsic to the data clustering task. In essence, both and , where is a diagonal matrix, indicate the same cluster assignment, and both and have the same objective values in (8a). Therefore, without loss of the clustering performance, we remove , from (8). Secondly, for bounded solution, we add the regularization term to (8), where are two parameters111Adding the regularization term makes the objective coercive, which subsequently can guarantee bounded solutions to the optimization problem. . Thirdly, without loss of generality, we replace with by adding a power exponent . As a result, we have the following problem formulation for data clustering:
Proposed clustering formulation:
(9a)
(9b)
\displaystyle~{}~{}~{}~{}\begin{array}[]{ll}&{\bf H}\geq{\bm{0}},\\ &\|{\bf h}_{j}\|_{p}^{v}=\|{\bf h}_{j}\|_{q}^{v},~{}j\in{\mathcal{N}},\end{array}\bigg{\}}\triangleq\mathcal{H}_{p,q}^{v}.
(9e)
One clear advantage of formulation (9) is its scalability. Concisely, when is fixed, problem (9) can be fully decoupled across the columns of into subproblems. This makes it easy to apply some decomposition methods for dealing with large-scale clustering tasks. Nevertheless, it is still challenging to solve (9). The main challenge lies in two interwound issues. The first is how to deal with the non-convex objective function and the (possibly non-smooth) norm-equality constraint (9e) together with the non-negativity constraint (9b). The second is how to choose proper values of , and since different choices lead to different theoretical and computational properties.
Before proceeding with the algorithmic development, we first present basic definitions for characterizing a proper solution of the non-convex and possibly non-smooth problem (9).
Definition 1
(Tangent cone) Let and . A vector is a tangent of at if either or there exists a sequence and positive scalars such that when . The tangent cone of at , denoted by , contains all the tangents of at .
Definition 2
(Directional derivative) Let be a possibly non-smooth function. Then is directionally differentiable if the directional derivative of along any direction
[TABLE]
exists at any .
If is differentiable, then the directional derivative in the above definition reduce to , where is the gradient vector of at .
Definition 3
(B-stationary solution)[34] For an optimization problem , where is directionally differentiable, and is a closed set. Then, is a B-stationary point if
If is a convex set, then the condition is equivalent to and is known as the d-stationary point. If is differentiable and is convex, then the condition reduces to and is simply called a stationary point [34, (1.3.3)].
As proved in Appendix A, the following proposition shows that any B-stationary point of (9) has non-zero columns for .
Proposition 1
Let be a B-stationary point of (9) satisfying . Then, .
The condition of is actually mild as it is known that the centroid should usually lie in the space spanned by the cluster samples [35]. Proposition 1 implies that a B-stationary point of (9) is clustering meaningful since, at a B-stationary point, each data sample must be properly assigned to one of the clusters.
III Proposed Non-Convex Penalty Method
In this section, we present the proposed NCP framework to handle problem (9). Let for . Since is equivalent to making mixed binary decisions for each entry of (being zero or non-zero), our intuition is to have an algorithm that can gently arrive at the mixed binary decision so that it can be less sensitive to bad initial points. Also, it is desirable to handle a problem with simple constraint sets. Motivated by this, we consider the following penalized formulation
[TABLE]
where , and is a penalty parameter. As seen from (11), since the non-convex norm-equality constraints in (9e) are penalized in the objective function, the penalized problem (11) involves simple convex constraint set only.
Like the classical penalty method [28, Chapter 17], we attempt to reach a good clustering solution of (9) through solving a sequence of penalized subproblems in (11), by gradually increasing . The proposed NCP framework is shown in Algorithm 1. It is expected that the norm-equality constraint (9e) can be gradually satisfied as increases, implying that the clustering assignment is achieved step by step in a smooth manner. This is in contrast to the classical K-means which has hard decision on the clustering assignment in the iterative process. This property also makes the proposed NCP method less sensitive to the choice of the initial point, which is one of the key advantages. An illustrative example demonstrating this point is given in [36, Section 5] of the supplementary material.
We study two types of penalty functions. In particular, we respectively consider a smooth penalty and a non-smooth penalty for (11). The smooth penalty is inspired by the classical quadratic penalty method [28, Chapter 17.1] where we choose for (11). For the non-smooth penalty, we choose and . As will be shown in Section III-B, such choice can provide so called exact penalty property [28]222We say the penalized problem like (11) has exact penalty if there exists a finite value of such that for any the solution of (11) is also a (feasible) solution of (9). while only requiring relaxed conditions on the solution of (11). To justify the effectiveness of the two penalties, in the ensuing two subsections, we present theoretical conditions for which the penalty method and Algorithm 1 can yield a feasible stationary solution to problem (9). In Section IV, efficient algorithms for solving (11) are presented.
III-A Smooth NCP (SNCP) Method
The proposed SNCP is obtained by choosing and in (9) and (11). Since , , the corresponding penalized problem (11) is given by
[TABLE]
which has G_{\rho}({\bf W},{\bf H})\triangleq F({\bf W},{\bf H})+\frac{\rho}{2}\sum_{j=1}^{N}\big{(}({\bf 1}^{\top}{\bf h}_{j})^{2}-\|{\bf h}_{j}\|_{2}^{2}\big{)} as the smooth objective function.
The following proposition provides the theoretical justification for the SNCP method.
Theorem 1
There exists a finite such that for any , any local minimum solution of (12) is a feasible and a local minimum solution to (9) (with , and ).
The proof of Theorem 1 is given in Appendix B. Theorem 1 shows that the SNCP has the exact penalty as long as a local minimum of (12) can be reached. Unfortunately, this cannot be guaranteed in general. The following theorem asserts a relaxed condition that if only a stationary solution (not local minimum) is obtained, (12) can still yield a feasible B-stationary solution of (9) as long as goes to infinity.
Theorem 2
*Let denote a stationary point of (12), and assume that is bounded and it has a limit point when . Then, is a feasible B-stationary point to (9). *
In practice, a finite value of would be sufficient to obtain a reasonably good solution as one will see in Section V. It is worthwhile to note that for the general quadratic penalty method [28, Chapter 17.1] to claim the same result as in Theorem 2, one requires conditions such as satisfies certain constraint qualification. Our Theorem 2 does not have such requirement thanks to the special constraint structure of (9); see the proof of Theorem 2 in Appendix C.
III-B Non-smooth NCP (NSNCP) method
In this subsection, we consider a non-smooth NCP method by setting and in (9) and (11). The corresponding penalized problem (11) is given by
[TABLE]
where the objective function F_{\rho}({\bf W},{\bf H})\triangleq F({\bf W},{\bf H})+{\rho}\sum_{j=1}^{N}\big{(}{\bf 1}^{\top}{\bf h}_{j}-\|{\bf h}_{j}\|_{\infty}\big{)} is non-smooth. Interestingly, analogous to the exact penalty in Theorem 1 but without the need of a local minimum solution, the NSNCP method allows one to obtain a feasible B-stationary solution of (9) if a d-stationary solution of (13) can be obtained.
Theorem 3
*Assume that is bounded. Then, there exists a finite such that for all , if is a d-stationary point of (13), then is feasible and a B-stationary point to (9). *
The proof of Theorem 3 is presented in Appendix D. By comparing Theorem 3 with Theorem 1, one can see that the NSNCP method in (13) is theoretically preferred since it does not require the penalized problem (13) to provide a locally optimal solution (which computationally cannot be ascertained) but only a d-stationary point; by comparing Theorem 3 with Theorem 2, we see that the NSNCP requires a finite value of in the non-smooth formulation (12a) versus an infinite value of in the smooth formulation (13a). However, since (13) is non-convex and non-smooth, it is more challenging to handle than its smooth counterpart (12). In the next section, we present efficient algorithms that can be used to obtain a proper stationary solution of the penalized problems (12) and (13), respectively.
Remark 1
(On the choice of and ) As mentioned, the smooth penalty in (12) with is inspired by the classical quadratic penalty method [28, Chapter 17.1]. A natural question is – can we have other choices? In fact, one can verify that as long as satisfies
[TABLE]
then (11) will be a smooth problem. Moreover, one can have the same claim as Theorem 2 based on a straightforward extension of the proof of Theorem 2. However, we prefer the choice in (12) since we are not aware of any theoretical result in the literature that suggests one can benefit from higher orders of penalty functions. In addition, our experiences in numerical experiments in fact indicate that higher order choices may yield poor performance. As shown in the supplementary material [36, Section 3.1], a large value of would make the landscape of has a flat valley around the origin. Besides, as shown in [36, Section 3.2], large values of and make Algorithm 1 less stable and more difficult to reach a feasible solution satisfying .
Our choice of for the non-smooth penalty (13) is constructed on purpose in order to achieve the exact penalty property in Theorem 3. As one can see from (46) that a key property to prove Theorem 3 is that the directional derivative of has an upper bound linear in and independent of the exact values of . Such property would not hold for or other choices of . In Section IV-B, we will further show that (13) with the negative infinity norm can have a simple proximal operator which enable us to solve (13) efficiently. **
IV Obtaining a Stationary Point of (11)
In accordance with Theorem 2 and Theorem 3, we need to obtain a stationary solution for problem (12) and a d-stationary solution for problem (13), respectively. In view of the separable constraint structure for and , the PALM algorithm in [29] is particularly efficient in handling the two problems.
IV-A Algorithm for Solving (12)
By applying the PALM algorithm to the smooth penalized problems (12), one simply performs block-wise gradient projection with respect to and iteratively, as shown in Algorithm 2. Here, and are two step size parameters. Denote and as the Lipschitz constants of and , respectively. Convergence of Algorithm 2 can be established based on [29].
Theorem 4
Let be the sequence generated by Algorithm 2 with and . Then is non-increasing, is bounded, and converges to a stationary point of (12).
Proof: Since is a coercive function, and the PALM updates in Algorithm 2 guarantees descent of the objective function [29, Remark 4(iii)] under and , are bounded. Besides, because satisfies the Kurdyka-Lojasiewicw (KL) property, and , are convex sets, we can obtain the desired results by [29, Theorem 1].
IV-B Algorithm for Solving (13)
The non-smooth penalized problem (13) is more challenging to handle due to the non-smooth and non-convex term . In particular, when applying the same PALM strategy to problem (13), the corresponding subproblem for updating is given by
[TABLE]
where and is the smooth component in (13a). Problem (15) is a proximal operator associated with the non-smooth and concave function . Intriguingly, while being non-convex, the proximal operator (15) has a simple closed-form solution as stated below.
Proposition 2
Consider the following problem
[TABLE]
where is given and is a scalar. Denote as an optimal solution of (16), and let be the unique index such that . Then, and
[TABLE]
for .
Proposition 2 is proved in Appendix E. By applying Proposition 2 to (15), we obtain the PALM method for solving problem (13) in Algorithm 3.
We show in the following theorem that Algorithm 3 can yield a d-stationary solution of the penalized problem (13), as desired by Theorem 3. Denote and as the Lipschitz constant of and , respectively.
Theorem 5
Let be the sequence generated by Algorithm 3 with and . Then is non-increasing, is bounded, and converges to a d-stationary point of (13).
The proof is presented in Appendix F. By combining Theorem 3 and Theorem 5, we see that the NSNCP method with Algorithm 1 and Algorithm 3 can yield a feasible B-stationary solution of problem (9).
We should emphasize again that Proposition 2 is critical to satisfy the condition of Theorem 3 since it enables Algorithm 3 to output a d-stationary solution of problem (13) as stated in Theorem 5. On the contrary, if one uses the subgradient method for problem (13), it can only achieve a critical point defined based on the subdifferential of the non-smooth terms [29, Theorem 1], which is weaker than the d-stationary point due to the difference-of-convex (DC) objective function of problem (13); see [37] for detailed discussions.
Before ending the section, we have the following remarks.
Remark 2
(Comparison between SNCP and NSNCP) As presented in Section III, the NSNCP method has a stronger theoretical result since it can guarantee exact penalty without necessarily achieving a local minimum solution of the penalized problem (11). However, numerically we found the two methods in fact perform comparably to each other in terms of data clustering; see Section V for comparison details of these two versions of the penalty approach. Moreover, both methods can often yield more favorable results than the classical K-means, and outperform the existing ONMF methods in terms of both clustering performance and computation time. This will be discussed in Section V. **
Remark 3
(Complexity comparison with existing ONMF methods) As discussed in [20], the DTPP [17], ONMF-S [20] have the same per-iteration complexity of the order , and the HALS [20] has the per-iteration complexity of the order whereas ONP-MF [21] has a much higher per-iteration complexity which is no less than due to the SVD of a matrix. One can easily verify that the per-iteration complexity of both the proposed Algorithms 2 and 3 is , which is comparable to the complexity order of DTPP, ONMF-S and HALS since . In Section V, we will further demonstrate that the proposed SNCP/NSNCP methods have faster convergence speed than most of the existing ONMF methods, and therefore are computation-time more efficient.
V Experiment Results
In this section, we examine the clustering performance of the proposed SNCP and NSNCP methods against 6 existing clustering methods, namely, K-means (KM), K-means++ [11], DTPP [17], ONP-MF [21], ONMF-S [19] and HALS [20]. Note that the later four methods are all based on the ONMF model (3). The adjusted Rand index (ARI) [38] and clustering accuracy (ACC) [32] are adopted for performance evaluation, both of which are widely-used metrics for clustering validation. ACC is the ratio of the number of correctly clustered data samples to the total number of samples. A data sample is said to be correctly clustered if it is assigned to the same cluster as that in the ground truth. To define ARI, let be a set of data samples, be a clustering result obtained by an algorithm, and is the clustering result by the ground truth. The ARI of the clustering is defined as
[TABLE]
where is the number of samples that are in both cluster of and of , is the number of samples in cluster of , is the number of samples in cluster of . A larger value of ARI indicates a better clustering result. However, due to limited space, we only present the results of ACC here and relegate the results of ARI in [36, Section 4]. Both of these methods are implemented with python and have been uploaded to https://github.com/wshuai317/NCP\_ONMF.
V-A Performance with Synthetic Data
We follow the linear model in [10] to generate the synthetic data where denotes the measurement noise. The signal to noise ratio (SNR) is defined as dB. We follow the same procedure as in [10] to generate , and the cluster assignment matrix , with , and (10 clusters). The number of data samples in the 10 clusters are and , respectively. Like [10], of the data samples are replaced by randomly generated outliers.
Parameter setting: In Algorithm 1, the initial penalty parameter is set to . For Step 4 of Algorithm 1, the orthogonality of is measured by
[TABLE]
where is a diagonal matrix such that rows of have unit 2-norm. One updates in Step 4 of Algorithm 1 whenever . Besides, we define
[TABLE]
The stopping condition of Algorithm 1 is when both and are sufficiently small.
For Algorithm 2, and are chosen as and , respectively; while for Algorithm 3, and are chosen as and , respectively. The stopping condition for Algorithm 2 and Algorithm 3 is the normalized residual of which is defined in the same way as (19) and is denoted as .
If not mentioned specifically, we choose and in (9). The parameter for increasing is set to 1.1. When it comes to the stopping condition, if not mentioned specifically, for the SNCP method, we set for Algorithm 1 and for Algorithm 2; for the NSNCP method, we set for Algorithm 1 and for Algorithm 3. For the four ONMF methods under comparison, the stopping condition is or the maximum iteration number of 2000 is achieved. All algorithms under test are initialized with 10 common, randomly generated initial points, and the presented results are averaged over the 10 experimental trials.
Effect of and as well as and : The parameters and as well as and do have impact on the algorithm convergence. To justify our choices above, we conducted extensive experiments with different combinations of these parameters. The numerical results are presented in [36, Section 1] of the supplementary material. The messages that one can infer from these experiments are that 1) smaller values of and are preferred, and 2) a larger can speed up to satisfy the orthogonality of , but requires a smaller value of in order to achieve a lower objective value of (9). Interested readers may refer to the supplementary material [36] for more discussions.
SNCP vs NSNCP: Let us first examine the convergence behaviors of the proposed SNCP and NSNCP methods. Fig. 1 and Fig. 1 respectively display the normalized residual and orthogonality achieved versus the iteration number of Algorithm 1 when the SNCP and NSNCP methods are used. One can see from Fig. 1 that both SNCP and NSNCP converge and they can converge faster for smaller values of . As seen in Fig. 1, both methods indeed can achieve an orthogonal . Moreoever, it can be observed that the NSNCP usually converge faster, and it takes about 130 iterations to reach , which is faster than the SNCP method.
To further examine the difference between the SNCP and NSNCP methods, we plot in Fig. 2 the convergence curves of clustering accuracy (ACC) versus the iteration number of Algorithm 1 for both the SNCP and NSNCP methods, on the synthetic data with SNR = dB. The results are averaged over 10 experiments, each of which uses a different initial point for Algorithm 1. The error bars (dashed line) are the standard deviation of the 10 experimental results. Intriguingly, one can observe that the NSNCP method actually converges faster than the SNCP method in terms of clustering accuracy, which echos Theorem 3 that the NSNCP method requires a finite only to achieve a feasible and meaningful solution of (9). However, as seen from the figure, the SNCP method eventually can reach a slightly higher clustering accuracy than the NSNCP method on the synthetic data.
In Fig. 2(b), we further present the clustering accuracy (ACC) versus the stopping condition of these two methods. One can observe that the clustering accuracy improves with a smaller for both methods. We also see that the performance gap between the two methods reduces with smaller , and when , they yield comparable clustering performance. This implies that the NSNCP method may require a more stringent in order to reach a desired clustering performance.
Comparison with the existing ONMF methods: We compare the convergence speed of the proposed SNCP/NSNCP methods with the ONMF methods. Fig. 3(a) and Fig. 3(b) respectively show the average normalized residual achieved versus the (total) iteration number. For the SNCP/NSNCP, the iteration number refers to the accumulated iteration number of Algorithm 2 (resp. Algorithm 3) when the outer loop Algorithm 1 is applied. It can be observed from Fig. 3 that these traditional ONMF methods suffer from slow convergence, whereas the SNCP/NSNCP can converge faster although at the beginning their movements are not so fast. Table II further shows the average iteration numbers of all methods under test to reach the normalized residual . One can see that DTPP and ONMF-S both require more than 2000 iterations while the HALS requires less iteration numbers on average. The SNCP/NSNCP both require less iteration numbers than the other three ONMF methods, and the NSNCP has the least.
Clustering performance: Table I lists the average clustering performance (ACC) of the eight methods under test on the synthetic data with different SNR values. All results are obtained by averaging over 20 experiments. In each experiment, all methods use the same randomly generated initial point.
First of all, one can observe that K-means++ does not perform better than the K-means for the synthetic dataset. The ONMF based methods (i.e., DTPP, ONP-MF, ONMF-S, HALS, and proposed SNCP/NSNCP) significantly outperform the K-means and K-means++. Nevertheless, one can see from Table I that the proposed SNCP and NSNCP consistently yield the best clustering performance.
Computational time: Table I also shows the CPU time taken by each method. The computer used in the experiments has a Ubuntu 16.04 OS, and equipped with 3.40 GHz Intel Core i7-6700 CPU and 52 GB RAM. As seen, other than the K-means and K-means++ which are well known computationally cheap, the proposed SNCP and NSNCP methods are more computationally time efficient than the other five ONMF based methods. In particular, while the ONP-MF can provide competitive clustering performance when SNR dB, its computation time is long. By contrast, the HALS can provide reasonably good clustering performance when SNR dB and its computation time is moderate. Lastly, we note that the NSNCP method is slightly faster than the SNCP method, though the latter can provide the best clustering performance. All these results are consistent with the discussion in Remark 3 and the results in Table II.
Clustering stability: We evaluate the stability of the clustering methods against different initial points. In particular, we adopt the consensus map and cophenetic correlation (CC) coefficient [5] to measure the stability. The consensus map is based on the consensus matrix , which has each entry if sample and sample is assigned to the same cluster, and entry otherwise. Then the consensus map, denoted by , is the heatmap of the average of consensus matrices obtained by 10 runs of experiments with different initial points. Thus, roughly speaking, in the consensus map , the th entry will be close to 1 if sample and sample are consistently assigned to the same cluster even under different initial conditions. The CC coefficient (between 0 and 1) is a qualitative measure of the consensus matrix and it is defined as the Pearson correlation of two distance matrices of data samples: one is given by using the off-diagonal entries of and the other one utilizes the cophenetic distances of samples after performing average linkage hierarchical clustering; details can be referred to [39]. The CC coefficient would approach 1 if the consensus map have entries closer to [math] or .
Fig. 4 shows the results for SNR = dB, and one can see that the proposed SNCP method gives the stablest clustering results, which is followed by the NSNCP method. In particular, clustering inconsistency only happens in small-sized clusters for these two methods.
V-B Application to Biological Data Analysis
In the experiment, we apply the various clustering methods to the The Cancer Genome Atlas (TCGA) database [30, 31], which contains the expression data of 20531 genes on 11135 cancer samples belonging to 33 cancer types. We consider 5 subsets of the TCGA dataset with different numbers of cancer types as shown in the 3nd row of Table III. The Pearson’s Chi-Squared Test [40] is applied to select 5000 genes () for each data sample.
For the proposed SNCP method and NSNCP method, an additional constraint is added to upper bound each entry of by the maximum value of the data. The other parameters are set the same as that in Section V-A.
Performance comparison: One can see from Table III that the proposed NSNCP method performs best on the TCGA data, especially for the first three datasets. The SNCP method provides very close performances as the NSNCP method on datasets 2, 3 and 4. Compared with the ONMF based methods, the K-means and K-means++ yield relatively poor performance on this TCGA data. In terms of computation time, the HALS is still the most time efficient one, especially when the sample size is small. However, there exists considerable performance gap between HALS and the proposed NCP methods and the computation advantage of HALS can diminish when the data size increases. Lastly, one can observe from the table that the other three ONMF methods are quite computationally expensive, particularly the ONP-MF and ONMF-S.
V-C Application to Document Clustering
We here examine the performance of the proposed methods on the document dataset TDT2 corpus [32] which consists of 10212 on-topic documents in total with 56 semantic categories. We extract 6 subsets, each of which contains 10 randomly picked categories (). Each document sample is normalized and represented as a term-frequency-inverse-document-frequency (tf-idf) vector [32]. The dimension of each test data is shown in the first three rows of Table IV. In the experiment, we set and for the proposed SNCP and NSNCP methods.
Performance comparison: As seen from Table IV, except for Datasets 5 and 6, the ONMF based methods yield better performance than the K-means/K-means++. In particular, K-means++ yields the highest clustering accuracy on Dataset 5, and K-means gives comparable performance on Dataset 6. A close inspection shows that for Dataset 5, the initial points picked by K-means++ happen to be close to the cluster centroids of the ground truth.
In addition, we can see that, except for Datasets 3 and 5, the SNCP method provides higher accuracy than the ONMF-S/ONP-MF. Besides, as seem from Table IV, the computation time of the ONMF-S/ONP-MF are large and can be 50 times slower than the SNCP method on Dataset 6. It is also seen that the HALS is most time efficient on this experiment although it only provides moderate clustering performance. While the NSNCP method does not perform as well as its smooth counterpart, it gives comparable clustering performance for the first four dataset, and is computationally time efficient compared to ONMF-S and ONP-MF.
Clustering on dimension-reduced data:
Since the feature size of the TDT2 data is large and in view of the fact that dimension reduction techniques can extract low-rank structure of data and improve the clustering performance, we apply the spectral clustering (SC) [12] to the TDT2 data followed by applying the various clustering methods to the dimension-reduced data. In the experiment, we set and each entry of is lower (resp. upper) bounded by the minimum (resp. maximum) value of the dimension-reduced data. The experimental results are displayed in Table V. As observed, all methods have a great leap in the clustering performance when compared to Table IV, and there is no significant performance gap between various methods. Nevertheless, the proposed SNCP method can provide competitive clustering results and performs best on Dataset 1, 2, and 6. Although the ONP-MF performs best on Dataset 3 and 5, it remains to be expensive in computation time. While the K-means++ gives the best performance on Dataset 4, one can see that the proposed SNCP method is only slighter worse.
To look into the reason why SC can greatly improve the clustering performance, we employ t-SNE [41] 333Although the performance of t-SNE may vary with different initial points and hyperparameters, we have tested different initial points in the experiments and found most of the results are consistent to each other. and visualize some of the TDT2 datasets in Fig. 5. Here, TDT2_2 denotes Dataset 2 in Table IV while TDT2_DR2 denotes dimension-reduced Dataset 2 by SC. We can see from Fig. 5 that, although all of Dataset 2, 4 and 6 exhibit much clearer cluster structures after dimension reduction by SC, data samples in both TDT2_DR2 and TDT2_DR6 still overlap with each other which is challenging for K-means and K-means++. This may explain to some extent why all methods in Table V have significantly improved clustering performance comparing to Table IV, and why the K-means++ does not perform as well as the SNCP on TDT2_DR2 and TDT2_DR6.
In particular, we see that data samples in TDT2_DR4 in Fig. 5(e) becomes almost separable, and therefore the K-means++ can achieve a nearly perfect performance and outperforms the SNCP method. By contrast, as seen in Fig. 5(d) and 5(f), data samples in both TDT2_DR2 and TDT2_DR6 still overlap with each other after dimension reduction, and their data samples are unevenly spread with complex shapes. Such case is known to be challenging for K-means and K-means++ [16]. As seen from Table V, the SNCP method, which outperforms the K-means++ on the two datasets, is more capable of capturing complicated cluster structures.
VI Conclusion
In this paper, we have proposed the ONMF based data clustering formulation (9) and the NCP approach (11) and Algorithm 1. We have considered one smooth NCP formulation (12) and one non-smooth NCP formulation (13), and analyzed the theoretical conditions of the two methods for which a feasible and meaningful solution of (9) can be obtained. Efficient implementations of the proposed methods based on the PALM algorithm (Algorithm 2 and Algorithm 3) have also been devised.
Extensive experiments have been conducted based on the synthetic data and real data TCGA and TDT2. In particular, when comparing to the existing K-means and ONMF based methods, the proposed methods can perform either significantly better or comparably in terms of clustering accuracy, while being much more time efficient than most of the other ONMF based methods.
It is worthwhile to mention some interesting directions for future research. Firstly, while the current PALM algorithms seem to work well, it is possible to employ some more advanced optimization algorithms such as Nesterov’s accelerated method [42, 43, 44] to improve algorithm convergence speed. Secondly, while parallel and distributed implementations of the proposed methods are feasible, it is important to reduce the communication overhead between cluster nodes for large-scale scenarios [45, 46]. It is also interesting to study joint dimension reduction and clustering methods [10] using the proposed NCP methods, and extend the current Euclidean distance measure to more general cost functions such as the -divergence [47, 48] or the cohesion measure in [49].
Appendix A Proof of Proposition 1
Proof: Since is a B-stationary point of problem (7) in the manuscript, we have
[TABLE]
Suppose that there exist such that . Then, (20) becomes
[TABLE]
Suppose that for some . Choose a direction such that , for some . It is obvious to see that based on Definition 1. By substituting into (A), we obtain
[TABLE]
which, however, is a contradiction.
Appendix B Proof of Theorem 1
As is a local minimizer of problem (12), there exists a neighborhood of , denoted by , such that
[TABLE]
Suppose there exists a such that . Let ,
[TABLE]
and define a hyperplane . We will show a contradiction that there exists a neighbor of that lies in the hyperplane and achieves a smaller objective than .
To the end, for , let
[TABLE]
which is obtained by replacing the th column of by
[TABLE]
where and One can see that since
[TABLE]
Firstly, we show that is a convex combination of . By (25), we have . Thus, we can obtain
[TABLE]
where . Rearranging terms in (27) gives rise to
[TABLE]
Notice that . So (28) reduces to
[TABLE]
which implies that is a convex combination of .
Secondly, we show that is strongly concave with respect to on the hyperplane as long as is sufficiently large. It is sufficient to show that is strongly concave in . This is true since is Lipschitz continuous with a bounded Lipschitz constant (denoted by ), and thus will be a strongly concave function as long as .
By the above two facts, we obtain that
[TABLE]
where Since by (25), will lie in for , (30) shows a contradiction to the local optimality of . Thus, for , must be feasible and a local minimizer to (9).
Appendix C Proof of Theorem 2
To show that is a B-stationary point to (9), it suffices to show
[TABLE]
where . Since is a stationary point of (12), we have
[TABLE]
which directly implies (31a) by taking . To prove (31b), we employ the following proposition and applies it to (9) for each column of .
Proposition 3
Consider the following problem
[TABLE]
where is smooth, and the corresponding penalized problem
[TABLE]
where is a penalty parameter. Let be a stationary point of (34). Assume that is bounded and as . Then, is a feasible B-stationary point of (33).
Proof: Since is a stationary point of (34), which is also a Karush-Kuhn-Tucker (KKT) solution to (34). By the complementary slackness, satisfies
[TABLE]
Let be the limit of a subsequence with . We then have
[TABLE]
Since is bounded due to bounded , it follows from (36) that as . Hence, is feasible to (33), and all components of are zero except one component, say .
We next show that
[TABLE]
i.e., is a B-stationary point of (33). Before that, we claim that the tangent cone . Indeed, any must satisfy for all and . This yields for all ; thus the tangent cone is contained in the right-hand cone. The reverse inclusion is obvious because for all sufficiently small . Thus, (37) boils down to
[TABLE]
We next prove that there cannot exist an index and a subsequence such that for all . If not, we may assume without loss of generality that this subsequence is the entire sequence . Thus for all . This is in addition to for all sufficiently large . By the complementary slackness, we have
[TABLE]
Since is bounded, and and as , we arrive at a contraction after taking the limit on both sides of inequality (40). Hence, for any converging subsequence , there are only finite terms that have at least one more nonzero component in addition to .
By restricting to a further subsequence of satisfying and for all , we have
[TABLE]
which implies (38), and the proof is complete.
Appendix D Proof of Theorem 3
Analogous to the proof of Theorem 2, it is sufficient to consider the following proposition.
Proposition 4
Consider the following problem
[TABLE]
where is smooth, and the corresponding penalized problem
[TABLE]
where is a penalty parameter. Let be a d-stationary point of (43). Assume that is bounded. Then there exists a finite such that for all , is a feasible B-stationary point to problem (42).
Proof: Denote . Since is a d-stationary point of (43), we have
[TABLE]
where . Because
[TABLE]
and , where , condition (44) implies
[TABLE]
On the other hand, for bounded , we can upper bound
[TABLE]
By substituting (47) into (46), we obtain
[TABLE]
We argue that for . Suppose not. Let us choose a tangent as follows
[TABLE]
Substituting (50) into (49) gives rise to
[TABLE]
which is a contradiction since . Thus must hold, i.e., is feasible to (42) if .
Next, we show that is a B-stationary point to (42). Since is feasible to in (42) for . Thus, the tangent cone of at is given by
[TABLE]
Besides, one can verify that for all ,
[TABLE]
Therefore, by applying (56) and (57) to (44), we obtain
[TABLE]
which is the desired result.
Appendix E Proof of Proposition 2
Denote . Since is an optimal solution of (16), we have
[TABLE]
where is the directional derivative of at direction , and
[TABLE]
As the directional derivative of is , we have from (59) that
[TABLE]
Let be an index such that . Then, , and (61) implies
[TABLE]
Now let us consider the case that . Then, by (60), it holds that
[TABLE]
Substituting (63) into (62) gives rise to
[TABLE]
which implies On the other hand, suppose . Then,
[TABLE]
Substituting (65) into (62) gives rise to
[TABLE]
which implies By combining the above two implications, we obtain . In addition, by following a similar argument as in (63) to (66), one can obtain for all . Therefore, (17) is true and is recapitulated here
[TABLE]
for . If one substitutes the optimal in (67) into the objective of (16), one can easily verify that the index must be unique.
Next, let us show that
[TABLE]
We respectively consider three cases.
Case (a): : For , and by (67), the optimal objective value of (16) is given by
[TABLE]
Obviously, we must have .
Case (b): : If and , we have the trivial solution according to (67) and the definition .
Suppose that and the set is non-empty. We claim that . Suppose . Then we have and it leads to
[TABLE]
Suppose. Then by (67), we have
[TABLE]
From (70) and (E), It is apparent that , and thereby .
Further consider two possible solutions, namely, and with and , respectively, where . Then, by (E), we have
[TABLE]
whenever since for . In other words, with larger leads to a smaller objective value, and thus we conclude that .
Case (c): and . There are three possible situations: 1) ; 2) ; and 3) .
[TABLE]
Clearly, .
- If . Then, by (67), we have
[TABLE]
Based on a similar argument as in (E), one can conclude .
From (73) and (74), it is obvious to that for . Besides, consider two possible solutions, namely, with where , and with where . Then, by (73) and (74), we have
[TABLE]
where the last inequality holds since and . Thus, we obtain .
By summarizing the results from Case (a) to (c), we conclude that (68) holds true. Proposition 2 is proved.
Appendix F Proof of Theorem 5
By a similar argument as the proof of Theorem 4, we obtain that the sequence generated by Algorithm 3 with and is a bounded sequence. We assume that is a limit point of , and is a limit value of .
At iteration , since is -smooth and , by the descent lemma [29, Lemma 3.2], we have
[TABLE]
Since is the optimal solution of (13), we have
[TABLE]
[TABLE]
By taking the limits of both sides of (78) as , we have
[TABLE]
which implies that
[TABLE]
The optimality condition of (80) yields
[TABLE]
Similarly, for the update of , one can show that
[TABLE]
Combing (81) and (82), we obtain that is a d-stationary point to problem (13).
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] C. C. Aggarwal and C. K. Reddy, Data Clustering: Algorithms and Applications . Boca Raton, FI, USA: Chapman & Hall/CRC Press, 2013.
- 2[2] S. Renaud-Deputter, T. Xiong, and S. Wang, “Combining collaborative filtering and clustering for implicit recommender system,” in Proc. IEEE AIAN , Barcelona, Spain, Mar. 25-28 2013, pp. 748–755.
- 3[3] J. Das, P. Mukherjee, S. Majumder, and P. Gupta, “Clustering-based recommender system using principles of voting theory,” in Proc. IEEE IC 3I , Mysore, India, Nov. 27-29 2014, pp. 230–235.
- 4[4] C.-H. Zheng, D.-S. Huang, L. Zhang, and X.-Z. Kong, “Tumor clustering using nonnegative matrix factorization with gene selection,” IEEE. Transactions on Information Technology in Biomedicine , vol. 13, no. 4, pp. 599–607, Jul. 2009.
- 5[5] J.-P. Brunet, P. Tamayo, T. R. Golub, and J. P. Mesirov, “Metagenes and molecular pattern discovery using matrix factorization,” in Proc. Natl. Acad. Sci. USA , Mar. 23 2004, pp. 4164–4169.
- 6[6] S. Wang, P. Wu, M. Zhou, T.-H. Chang, and S. Wu, “Cell subclass identification in single-cell RNA-sequencing data using orthogonal non-negative matrix factorization,” in Proc. IEEE ICASSP , Calgary, Canada, Apr. 15-20 2018, pp. 876–880.
- 7[7] S. Lloyd, “Least squares quantization in PCM,” IEEE Trans. Info. Theory , vol. 28, no. 2, pp. 129–137, Mar. 1982.
- 8[8] H. Ding, Y. Liu, L. Huang, and J. Li, “K-means clustering with distributed dimensions,” in Proc. ICML , New York, USA, Jun. 19-24 2016, pp. 1339–1348.
