A general model for plane-based clustering with loss function
Zhen Wang, Yuan-Hai Shao, Lan Bai, Chun-Na Li, and Li-Ming Liu

TL;DR
This paper introduces a comprehensive model for plane-based clustering that unifies existing methods and proposes a new loss function, with theoretical guarantees and experimental validation on artificial and real datasets.
Contribution
The paper presents a general framework encompassing various plane-based clustering methods and introduces a novel loss function for improved data distribution capture.
Findings
The model terminates in finite steps at a local or weak local optimum.
The new loss function effectively captures data distribution.
Experimental results verify the method's effectiveness.
Abstract
In this paper, we propose a general model for plane-based clustering. The general model contains many existing plane-based clustering methods, e.g., k-plane clustering (kPC), proximal plane clustering (PPC), twin support vector clustering (TWSVC) and its extensions. Under this general model, one may obtain an appropriate clustering method for specific purpose. The general model is a procedure corresponding to an optimization problem, where the optimization problem minimizes the total loss of the samples. Thereinto, the loss of a sample derives from both within-cluster and between-cluster. In theory, the termination conditions are discussed, and we prove that the general model terminates in a finite number of steps at a local or weak local optimal point. Furthermore, based on this general model, we propose a plane-based clustering method by introducing a new loss function to capture the…
|
|||||||||||||||||||
|
|||||||||||||||||||
| Group | kPC | PPC | TWSVC | RTWSVC | FRTWSVC | RampTWSVC | RFDPC |
|---|---|---|---|---|---|---|---|
| AC(%)/MI(%) | AC(%)/MI(%) | AC(%)/MI(%) | AC(%)/MI(%) | AC(%)/MI(%) | AC(%)/MI(%) | AC(%)/MI(%) | |
| G1 | 60.74/27.18 | 62.20/29.63 | 56.57/27.55 | 72.82/33.97 | |||
| G2 | 87.37/67.10 | 87.39/67.34 | 62.79/28.42 | 87.37/67.10 | 87.37/67.10 | 84.38/56.71 | |
| G3 | 56.42/13.13 | 61.14/40.67 | 58.96/21.39 | 87.34/67.21 | 87.34/67.21 | 65.09/23.83 | |
| G4 | 58.87/24.09 | 58.81/11.19 | 61.53/30.08 | 57.06/16.79 | 57.18/21.07 | 73.45/48.10 |
| Data | kmeans | kPC | PPC | TWSVC | RTWSVC | FRTWSVC | RampTWSVC | RFDPC |
| k:mn | AC(%) | AC(%) | AC(%) | AC(%) | AC(%) | AC(%) | AC(%) | AC(%) |
| MI(%) | MI(%) | MI(%) | MI(%) | MI(%) | MI(%) | MI(%) | MI(%) | |
| Compound | 86.294.05 | 72.04 | 73.90 | 75.97 | 80.44 | 82.52 | 77.07 | |
| 6:3992 | 7.14 | 34.71 | 38.85 | 50.44 | 56.63 | 48.18 | 48.37 | 67.38 |
| Dermatology | 69.760.77 | 60.50 | 70.36 | 71.93 | 60.50 | 60.50 | 72.67 | |
| 6:36634 | 11.472.15 | 29.65 | 3.48 | 10.17 | 28.95 | 28.59 | 24.42 | |
| Ecoli | 82.192.68 | 33.11 | 66.46 | 85.74 | 34.33 | 34.33 | 79.42 | |
| 8:3367 | 56.844.42 | 8.61 | 9.65 | 58.45 | 10.42 | 10.42 | 43.35 | |
| Glass | 65.583.22 | 55.73 | 66.62 | 57.59 | 57.40 | 62.77 | 62.37 | |
| 6:2149 | 2.23 | 22.55 | 8.54 | 17.83 | 17.69 | 18.20 | 20.95 | 12.61 |
| Iris | 84.576.86 | 67.54 | 60.95 | 91.24 | 92.67 | 94.95 | 86.79 | |
| 3:1504 | 70.479.10 | 25.41 | 12.04 | 82.53 | 82.31 | 86.97 | 71.71 | |
| Pathbased | 74.850.09 | 66.49 | 74.57 | 73.94 | 76.30 | 76.30 | 65.73 | |
| 3:3002 | 51.460.16 | 30.17 | 50.92 | 47.90 | 54.63 | 54.63 | 28.21 | |
| Zoo | 87.491.96 | 54.12 | 84.06 | 88.83 | 54.12 | 54.12 | 90.22 | |
| 7:10116 | 71.933.15 | 34.23 | 55.56 | 72.93 | 32.15 | 32.15 | 71.79 | |
| Aggregation | 91.910.69 | 79.19 | 79.00 | 88.49 | 82.82 | 84.10 | 80.71 | |
| 7:7882 | 81.240.75 | 48.84 | 48.43 | 63.52 | 64.23 | 60.70 | 52.36 | |
| R15 | 0.67 | 92.15 | 92.00 | 93.76 | 93.07 | 92.91 | 81.76 | 96.92 |
| 15:6002 | 2.40 | 64.86 | 59.28 | 73.64 | 73.49 | 67.33 | 47.37 | 86.80 |
| Vehicle | 63.242.21 | 62.03 | 62.77 | 51.00 | 65.23 | 65.00 | 58.59 | |
| 4:84618 | 17.730.82 | 3.25 | 1.28 | 9.02 | 12.63 | 12.07 | 14.84 | |
| Vowel | 0.61 | 82.93 | 84.10 | 83.28 | 84.23 | 83.60 | 80.94 | 84.18 |
| 11:52810 | 1.94 | 11.39 | 10.60 | 11.57 | 24.28 | 7.73 | 25.93 | 34.20 |
| Echocardiogram | 66.417.92 | 52.81 | 56.66 | 56.10 | 75.01 | 75.01 | 71.84 | |
| 2:13110 | 24.7917.27 | 0.54 | 2.99 | 1.35 | 39.64 | 39.64 | 35.46 | |
| Haberman | 49.910.02 | 49.84 | 60.95 | 61.89 | 61.57 | 62.21 | 60.95 | |
| 2:3063 | 0.040.04 | 0.07 | 0.74 | 2.28 | 4.44 | 0.74 | 8.70 | |
| Heartc | 51.040.00 | 50.12 | 50.23 | 50.67 | 59.21 | 59.21 | 50.75 | |
| 2:30314 | 1.390.00 | 0.05 | 0.14 | 0.90 | 13.98 | 13.98 | 1.14 | |
| Heartstatlog | 51.450.07 | 50.04 | 50.35 | 50.81 | 51.40 | 51.40 | 51.82 | |
| 2:27013 | 1.870.07 | 0.20 | 0.15 | 0.63 | 1.63 | 1.67 | 2.40 | |
| Hepatitis | 62.773.03 | 55.56 | 71.90 | 66.27 | 67.02 | 67.02 | 69.38 | |
| 2:15519 | 0.290.13 | 0.96 | 14.93 | 0.17 | 7.18 | 1.95 | 6.09 | |
| Hourse | 50.150.00 | 51.34 | 50.15 | 51.34 | 51.34 | 52.12 | 51.98 | |
| 2:30026 | 1.240.00 | 0.55 | 1.24 | 0.55 | 0.55 | 0.46 | 0.25 | |
| Housevotes | 78.830.15 | 63.77 | 68.77 | 75.83 | 71.40 | 71.40 | 79.61 | |
| 2:43516 | 48.070.38 | 34.16 | 27.27 | 44.66 | 39.36 | 39.36 | 50.15 | |
| Sonar | 50.220.18 | 49.80 | 49.99 | 50.43 | 51.26 | 50.06 | ||
| 2:20860 | 0.740.28 | 0.01 | 0.23 | 0.64 | 2.06 | 0.67 | 2.72 | |
| Spect | 52.970.00 | 65.86 | 50.67 | 65.86 | 50.88 | 50.58 | 67.17 | |
| 2:26744 | 0.00 | 0.51 | 0.51 | 0.51 | 0.35 | 0.34 | 1.15 | 1.17 |
| Spectf | 53.952.31 | 49.49 | 50.51 | 51.93 | 41.93 | 51.39 | 53.20 | |
| 2:8844 | 6.43 | 0.19 | 1.67 | 6.34 | 4.00 | 3.03 | 6.51 | 15.15 |
| Pimaindian | 55.070.00 | 51.74 | 54.50 | 57.99 | 53.97 | 54.82 | 55.07 | |
| 2:7688 | 2.670.00 | 0.23 | 0.09 | 5.64 | 0.21 | 0.58 | 0.95 | |
| Tictactoe | 50.900.47 | 50.08 | 96.71 | 55.26 | 59.84 | 59.84 | 55.82 | |
| 2:95827 | 0.690.14 | 0.69 | 87.89 | 0.97 | 13.48 | 13.48 | 1.95 | |
| AC-win | 2 | 0 | 2 | 0 | 0 | 1 | 1 | 18 |
| MI-win | 6 | 0 | 1 | 0 | 1 | 1 | 2 | 12 |
| Both-win | 2 | 0 | 1 | 0 | 0 | 1 | 1 | 12 |
| Data | kmeans | kPC | PPC | TWSVC | RTWSVC | FRTWSVC | RampTWSVC | RFDPC |
| k:mn | AC(%) | AC(%) | AC(%) | AC(%) | AC(%) | AC(%) | AC(%) | AC(%) |
| MI(%) | MI(%) | MI(%) | MI(%) | MI(%) | MI(%) | MI(%) | MI(%) | |
| Compound | 84.844.11 | 90.16 | 70.32 | 90.25 | 90.16 | 90.16 | 89.73 | |
| 6:3992 | 70.656.84 | 16.97 | 71.60 | 72.24 | 72.24 | 64.07 | 61.57 | |
| Dermatology | 71.661.26 | 72.66 | 70.62 | 72.60 | 72.60 | 72.60 | 72.90 | |
| 6:36634 | 17.843.67 | 18.00 | 3.65 | 18.00 | 18.00 | 18.00 | 20.16 | |
| Ecoli | 79.931.24 | 82.49 | 69.13 | 88.29 | 82.49 | 82.68 | 83.01 | |
| 8:3367 | 49.312.28 | 57.79 | 16.46 | 57.79 | 57.57 | 49.97 | 50.97 | |
| Glass | 69.271.45 | 69.04 | 66.82 | 70.10 | 69.04 | 69.04 | 70.77 | |
| 6:2149 | 37.502.09 | 7.35 | 23.42 | 41.42 | 41.42 | 29.18 | 30.81 | |
| Iris | 87.638.09 | 91.24 | 59.47 | 91.24 | 91.24 | 91.24 | 94.95 | |
| 3:1504 | 76.269.85 | 79.15 | 13.93 | 79.15 | 79.15 | 79.15 | 85.59 | |
| Pathbased | 0.18 | 76.29 | 59.94 | 80.57 | 76.29 | 76.29 | 91.92 | 93.92 |
| 3:3002 | 0.40 | 57.51 | 11.60 | 65.87 | 57.51 | 57.51 | 79.83 | 82.28 |
| Zoo | 87.143.39 | 90.63 | 89.52 | 90.63 | 90.63 | 90.63 | 91.15 | |
| 7:10116 | 70.795.39 | 77.99 | 72.90 | 77.99 | 77.99 | 77.99 | 70.27 | |
| Echocardiogram | 71.140.82 | 55.04 | 56.66 | 56.66 | 55.04 | 55.04 | 71.84 | |
| 2:13110 | 32.410.53 | 0.85 | 2.73 | 2.73 | 0.85 | 0.85 | 28.53 | |
| Haberman | 60.610.30 | 63.21 | 61.26 | 63.21 | 63.21 | 63.21 | ||
| 2:3063 | 0.230.13 | 4.97 | 0.75 | 4.97 | 4.97 | 4.97 | 5.41 | |
| Heartc | 50.760.09 | 51.37 | 51.26 | 50.50 | 51.37 | 51.37 | 52.44 | |
| 2:30314 | 1.740.31 | 2.19 | 1.68 | 0.62 | 2.19 | 2.19 | 3.52 | |
| Heartstatlog | 50.830.41 | 53.00 | 51.54 | 50.92 | 53.00 | 53.00 | 54.22 | |
| 2:27013 | 1.880.54 | 3.79 | 1.64 | 0.81 | 3.79 | 3.79 | 6.25 | |
| Hepatitis | 65.351.58 | 66.27 | 67.79 | 67.79 | 66.27 | 66.27 | 67.02 | |
| 2:15519 | 1.040.68 | 0.29 | 2.01 | 2.01 | 0.29 | 0.29 | 0.29 | |
| Hourse | 52.270.25 | 52.12 | 53.05 | 51.71 | 52.12 | 52.12 | 52.12 | |
| 2:30026 | 0.680.57 | 0.46 | 2.30 | 0.50 | 0.46 | 0.46 | 0.46 | |
| Housevotes | 79.790.94 | 75.50 | 75.83 | 75.50 | 75.50 | 80.68 | 79.96 | |
| 2:43516 | 46.911.87 | 42.09 | 46.38 | 42.09 | 42.09 | 48.86 | 48.74 | |
| Sonar | 50.160.28 | 51.62 | 52.66 | 52.22 | 51.62 | 51.62 | ||
| 2:20860 | 0.390.39 | 4.24 | 4.08 | 5.43 | 4.24 | 4.24 | 6.64 | |
| Spect | 60.684.79 | 66.73 | 68.06 | 68.06 | 66.73 | 66.73 | 68.98 | |
| 2:26744 | 3.383.72 | 0.17 | 2.35 | 2.35 | 0.17 | 0.17 | 10.96 | |
| Spectf | 63.870.94 | 50.16 | 50.16 | 50.16 | 50.16 | 62.03 | 70.76 | |
| 2:8844 | 21.841.52 | 3.88 | 3.88 | 3.88 | 3.88 | 20.54 | 34.36 | |
| AC-win | 1 | 0 | 1 | 2 | 0 | 0 | 3 | 12 |
| MI-win | 1 | 2 | 1 | 3 | 0 | 0 | 5 | 5 |
| Both-win | 1 | 0 | 1 | 1 | 0 | 0 | 2 | 5 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
A general model for plane-based clustering with loss function
Zhen Wang, Yuan-Hai Shao, Lan Bai, Chun-Na Li, and Li-Ming Liu Submitted in . This work is supported in part by National Natural Science Foundation of China (Nos. 11501310, 61866010, 11871183, and 61703370), in part by Natural Science Foundation of Hainan Province (No. 118QN181), and in part by Scientific Research Foundation of Hainan University (No. kyqd(sk)1804).Zhen Wang is with School of Mathematical Sciences, Inner Mongolia University, Hohhot, 010021, P.R.China e-mail: [email protected] Shao (*Corresponding author) is with School of Economics and Management, Hainan University, Haikou, 570228, P.R.China e-mail: [email protected] Bai is with School of Mathematical Sciences, Inner Mongolia University, Hohhot, 010021, P.R.China e-mail: [email protected] Li is with Zhijiang College, Zhejiang University of Technology, Hangzhou, 310024, P.R.China e-mail: [email protected] Liu is with School of Statistics, Capital University of Economics and Business, Beijing, 100070, P.R.China. e-mail: [email protected].
Abstract
In this paper, we propose a general model for plane-based clustering. The general model contains many existing plane-based clustering methods, e.g., k-plane clustering (kPC), proximal plane clustering (PPC), twin support vector clustering (TWSVC) and its extensions. Under this general model, one may obtain an appropriate clustering method for specific purpose. The general model is a procedure corresponding to an optimization problem, where the optimization problem minimizes the total loss of the samples. Thereinto, the loss of a sample derives from both within-cluster and between-cluster. In theory, the termination conditions are discussed, and we prove that the general model terminates in a finite number of steps at a local or weak local optimal point. Furthermore, based on this general model, we propose a plane-based clustering method by introducing a new loss function to capture the data distribution precisely. Experimental results on artificial and public available datasets verify the effectiveness of the proposed method.
Index Terms:
Unsupervised learning, plane-based clustering, general model, twin support vector clustering, loss function.
I Introduction
Clustering, discovering the similarity among the data samples, is one of the most important unsupervised learning topics [1, 2, 3]. Many approaches assign the samples into the clusters via certain cluster centers [4, 5, 6, 7, 8, 9]. The plane-based clustering treats the cluster center as a plane, and thus it is able to find the plane-based shape clusters. Moreover, the plane-based clustering can be extended to nonlinear manifold modeling easily to cope with complex data structures. The plane-based clustering has attracted much attention [8, 10, 11, 12, 13, 14, 15, 16].
The first plane-based clustering, k-plane clustering (kPC) [8], was proposed by O.L. Mangasarian et al., where the discriminative information from within-cluster was considered. Subsequently, the discriminative information from between-cluster has been introduced in plane-based clustering. For instance, proximal plane clustering (PPC) [11] and twin support vector clustering (TWSVC) [12] considered that the cluster center plane should be not only as close as possible to the current cluster samples but also far away from the other clusters. Still later, robust twin support vector clustering (RTWSVC) and fast robust twin support vector clustering (FRTWSVC) were also appeared [16]. Until recently, ramp-based twin support vector clustering (RampTWSVC) [17] was proposed to deal with noise or outliers. So it is interesting to find a cluster center plane by considering the discriminative information both from within-cluster and between-cluster.
Let us notice the close relationship between the cluster problem and the classification problem. In fact, there are the following corresponding relationships between them: PPC corresponds to the generalized eigenvalue proximal support vector machines (GEPSVM) [18], TWSVC to the twin support vector machines (TWSVM) [19, 20], RTWSVC to the -TWSVM [21], FRTWSVC to the least square TWSVM [22], and RampTWSVC to the best fitting hyperplanes for classification (BFHC) [23]. So it seems like a great way to relate the plane-based clustering to the supervised learning.
Briefly speaking, the supervised learning is essentially based on two concepts “loss function” and “regularization” [24, 25, 26, 27, 28, 21, 22, 29, 30]. We find that the plane-based clustering can also be established in a similar way. This yields our general model. It is concerned with the new defined loss function from discriminative information [31] and the regularization. The general model iteratively implements two parts: cluster update and cluster assignment. In the cluster update, the new cluster center planes would be obtained by minimizing the loss derived from the current cluster assignment. Besides, in the cluster assignment, each sample would be assigned to the cluster with the least loss. For the general model, it is allowed to select various loss functions and regularization terms, and most of the existing plane-based clustering methods can be regarded as the particular cases with different selections. Furthermore, a new plane-based clustering is derived from the general model. More precisely, following the model, we propose a robust fitting distribution planes clustering (RFDPC) by hiring a new loss function, the ramp loss [23] combined with certain statistics, which owns clear geometric meaning and captures the data distribution.
The main contributions of this paper include:
(i) A general model for plane-based clustering is proposed, in which different loss functions and regularization terms can be chosen, particularly yielding the existing kPC, PPC, TWSVC, RTWSVC, FRTWSVC, RampTWSVC, and etc.
(ii) The cluster update and cluster assignment in the general model is consistent on minimizing the loss of samples, resulting in its finite termination at a local or weak local optimal point.
(iii) A new loss function is introduced in the general model with named RFDPC, to cope with outliers, noise, and capture the data distribution more precisely.
(iv) Experiments show the amazing performance of RFDPC compared with the existing plane-based clustering methods.
The rest of this paper is organized as follows. The general model is elaborated in section II. Some plane-based clustering methods are summarised under the general model in section III. A novel plane-based clustering method (RFDPC) is described in section IV. Experiments and conclusions are presented in sections V and VI, respectively.
II The general model for plane-based clustering
II-A Formulation
Remind the clustering problem with data samples in the -dimensional real vector space , which is represented by . Assume that these samples belong to clusters with their corresponding labels . Our task is to assign the samples into clusters, or to give their cluster labels
[TABLE]
For partition-based clustering [32, 7, 33, 34], the usual way is to find the cluster labels as well as the cluster centers. The plane-based clustering treats each cluster center as a plane. The cluster center planes are described as
[TABLE]
where is the weight vector and is the bias term. Consider the deviation of a sample from the -th cluster center plane (). For instance, the deviation can be measured by the signed distance of to the plane as
[TABLE]
where denotes norm. Another simpler way to reduce computation is to hire
[TABLE]
Combining these deviation functions, either (3) or (4), yields the vector function
[TABLE]
where and . Thus, the cluster center planes can be represented as
[TABLE]
One of the popular approaches [8, 11, 12] is to find the cluster labels and the cluster center planes (6) iteratively. Start with an initial assignment . Next, for the given , find the corresponding by establishing and solving an optimization problem. Then, update and the vector function alternately until certain termination conditions are satisfied.
The key point of this paper is to introduce the loss function into a general optimization problem. For the th sample with assigned label , the ideal case is to find and such that, on one hand the sample lies exactly on the center plane , and on the other hand, the sample is far away from other center planes (in extremity, , ). For the actual situation, the loss of sample should be a measure of the deviation from the ideal case. Therefore, it should consist of two parts: (i) for its own center plane, the loss should depend on the deviation and can be measured by a within-cluster function , where is a function from to with the condition ; (ii) for other center planes, the loss can be measured by a between-cluster function with , where is a function from to . Thus, for the sample (), the loss is described by
[TABLE]
where and are positive parameters. Furthermore, for the dataset , the total loss is
[TABLE]
This leads to the optimization problem for both and (, ) as
[TABLE]
where denotes the regularization term in the functional space .
Problem (9) is very similar to the optimization problem in supervised learning, i.e., it consists of the loss and regularization, but their concerns are not the same. For supervised learning, it aims at predicting the unknown samples by minimizing the loss of the training samples. However, for clustering, we focus on minimizing the loss of the given samples, and the regularization makes this efficient. Based on problem (9), the general model for plane-based clustering is constructed in Model 1.
Model 1 The general model
Input: Dataset , the within-cluster function , the between-cluster function , and the parameters , .
Output: and .
-
Initialize the sample labels .
-
For , compute and by the following steps:
(a) Cluster update: For the current , is set to be the solution to the optimization problem
[TABLE]
where is given by (9), or equivalently the solutions to subproblems with as follow:
[TABLE]
(b) Cluster assignment: For the current , the labels are set to be the solution of the following optimization problem
[TABLE]
or equivalently, the labels are given by
[TABLE]
with . If there is a tie, the cluster with the smallest label number is selected.
(c) Repetitiveness check: If is a solution to problem (10) where is replaced by , break the loop and go to step 3.
(d) If the termination condition is satisfied, go to step 3; otherwise, set and back to step 2(a).
- Set , , .
In step 1 of Model 1, a common way is to assign the samples into clusters randomly, resulting in unstable clustering performance. It is preferable to choose some stable initialization techniques, e.g., nearest neighbor graph (NNG) [12], which has been successfully applied to several plane-based clustering methods [12, 16, 17]. In step 2(a), if there are many global solutions can be obtained, the same ones are selected for the same . However, we may only get a local solution. Note that there has been (for ) before solving problem (10). The local solution in step 2(a) must be not worse than previous solution if a local solution is inevitable. Thus, it is a good choice to hire the -th local solution as the initial point of the -th problem in step 2(a), if the assumption is false in step 2(c). In other words, the inequality always holds in iteration.
Besides, the functions and should also be pre-defined. Obviously, it is reasonable to select them with the following properties.
(i) and .
(ii) is monotonically non-decreasing in .
(iii) is monotonically non-increasing in .
In this case, we have following theorem.
** II.1****.**
If the within-cluster function and the between-cluster function satisfy the above three properties (i)-(iii), then the sample assignment (13) can be simplified as
[TABLE]
where denotes the absolute value.
Proof.
Suppose that, for an arbitrary sample , is the label of obtained by (14), i.e., is the smallest one in , and suppose is an arbitrary label of . The objective values of (13) at and are
[TABLE]
and
[TABLE]
respectively.
From the properties (i) and (ii), we have . Similarly, we have from the properties (i) and (iii). Thus, because of the positive and . This implies that corresponds to the smallest objective value of (13). ∎
Now, we extend the general model to the nonlinear case via a kernel trick [4, 35, 12, 36]. For the nonlinear manifold clustering, the cluster centers are defined as
[TABLE]
where is a pre-defined nonlinear mapping. Thus, the deviation of a sample from a cluster center depends on the nonlinear mapping strictly. Generally, it is not necessary to give the explicit nonlinear mapping . Note that the deviation in general model is just considered. There are many kernel tricks to estimate the deviation. For instance, the deviation can be estimated by , where is a predetermined kernel function [35] and ( denotes the inner product). By selecting an appropriate kernel function, the nonlinear general model can be obtained without any difficulty, so the details are omitted.
II-B Analysis
In this subsection, the termination conditions of the above general model are analysed. More exactly, it is concerned with the following three termination conditions.
(i) It happens that there is a repeated overall assignment of samples to clusters, i.e., where [8, 11].
(ii) It happens that there is a non-decrease in the objective function [8, 11].
(iii) Both the cases (i) and (ii) happen.
Corresponding to the different meaning of the solution in step 2(a), we have the following two theorems.
** II.2****.**
Under either termination condition (i) or (ii), the general model terminates in a finite number of steps if the solution in step 2(a) means global solution.
Proof.
The iterations in the general model can be summarized as
[TABLE]
Since there are a finite number of ways that the samples can be assigned to clusters, there are two integers such that . Therefore, the general model terminates in a finite number of steps under termination condition (i).
Moreover, the corresponding and are the global solutions to the same optimization problem (10). Thus, we have . Note that the global solution in step 2(a) guarantees that the objective is non-increasing in iteration. Then we have
[TABLE]
Therefore, the general model terminates in a finite number of steps under termination condition (ii). ∎
** II.3****.**
Suppose the number of the local solutions or the local optimal values to the problem is finite. Under termination condition (iii), the general model terminates in a finite number of steps if the solution in step 2(a) means local solution.
Proof.
Consider sequence (18). Since there are a finite number of ways that the samples can be assigned to clusters, we can find a subsequence of from (18) in which the elements are the same. Based on the assumptions, there are two integers such that , where belongs to the above subsequence of . Since the objective is non-increasing in the iteration, it is invariable from the step to . Therefore, the general model terminates before or at step under termination condition (iii). ∎
Generally speaking, the general model may also terminate in a finite number of steps with other termination conditions. However, the termination point obtained by the general model would be very different under different termination conditions. In fact, when the general model terminates, there should not be any other available points which make the objective function decrease. To study the convergence of the general model further, we introduce two definitions.
** II.1****.**
( by O.L. Mangasarian in [8]) Point is defined as the local optimal point to the function if is the global solution to the problem , and meanwhile is the global solution to the problem .
** II.2****.**
() Point is defined as the weak local optimal point to the function if is the global solution to the problem , and meanwhile is a local solution to the problem .
Now, we have the following two theorems.
** II.4****.**
The general model with termination condition (i) or (ii) terminates at a local optimal point if the solution in step 2(a) means global solution.
Proof.
From the proof of Theorem 2.2, there is a finite number such that equations (19) hold. Thus, the point is a local optimal point and . Then, the general model terminates at step under termination condition (i) or (ii). ∎
** II.5****.**
Suppose the number of the local solutions or the local optimal values to the problem is finite. The general model with termination condition (iii) terminates at a weak local optimal point if the solution in step 2(a) means local solution.
Proof.
From the proof of Theorem 2.3, there are two finite integers such that . Due to the non-increase of the objective in the iteration, equations (19) hold and . Note that is the global solution to the problem , and also attains the same optimal value. It shows that is also the global solution to the above problem. Thus, holds because of the uniqueness of the assigned labels guaranteed in step 2(b), and then is a weak local optimal point. Therefore, the conclusion holds by Theorem 2.3. ∎
III Reorganization of the plane-based clustering methods
In this section, we show that the general model yields current plane-based clustering methods by selecting different deviation formations and loss functions.
III-A kPC
kPC [8] is the first plane-based clustering method. It starts with a random assignment of the samples. Then, for the th cluster (), its cluster center (2) requires the samples be along with it by solving the following problem
[TABLE]
When the cluster centers are obtained, the samples are reassigned to clusters by
[TABLE]
The cluster centers and the samples’ labels are updated alternately until termination condition (i) or (ii) is satisfied.
To organize kPC by the general model, we select and hire the within-cluster function and between-cluster function with (see Fig. 1(i)). Thus, the loss of the sample () is
[TABLE]
Without any difficulty, we can use the general model to generate a plane-based clustering method by using the loss function (22). By setting and , problem (11) solved in the general model is equivalent to problem (20) in kPC. Since and satisfy the conditions of Theorem 2.1, it is easy to conclude that kPC is consistent with the general model by the loss function (22). The global solution to problem (20) can be obtained by solving an eigenvalue problem, and we immediately conclude that kPC finitely terminates at a local optimal point by Theorem 2.4 (this finite termination has been proven by Mangasarian, see Theorem 7 in [8]).
It is worth to notice that kPC only considers the discriminative information from within-cluster. The following PPC was proposed by introducing the discriminative information from between-cluster.
III-B PPC
The procedure of PPC [11, 10] is similar to kPC, where the only difference is the stage of reconstructing the cluster centers. PPC considers the samples from the current cluster should close to its cluster center, and meanwhile the samples from different clusters should be far away from it. The th () cluster center plane is obtained by solving following problem
[TABLE]
where is a positive parameter.
Similarly, to organize PPC by the general model, we select and hire the functions and with (see Fig. 1(ii)). Thus, the loss of the sample () is
[TABLE]
Obviously, and satisfy the conditions of Theorem 2.1. Therefore, PPC can be regarded as the general model by using the loss function (24). Since the global solution to problem (23) can be obtained by solving an eigenvalue problem, we can immediately conclude that PPC finitely terminates at a local optimal point by Theorem 2.4, which was not provided previously. By the loss function (24), it can be seen that PPC uses norm to measure the discriminative information from between-cluster, which may be sensitive with noise or outliers.
III-C TWSVC
To reduce the influence of the noise and outliers, TWSVC [12] makes the samples from different clusters far away from the cluster center to a certain distance. The th () cluster center is considered from following problem
[TABLE]
where is a slack variable.
By selecting and hiring the functions and with , (see Fig. 1(iii)), the loss of the sample () is
[TABLE]
where replaces the negative value with zero. Obviously, and satisfy the conditions of Theorem 2.1. Therefore, TWSVC can be regarded as our general model by using the loss function (26) except a slight difference in the solution to problem (25), which is obtained independently. It is worth to mention that if TWSVC is implemented by the general model strictly, it would terminate in a finite number of steps at a weak local optimal point by Theorem 2.5.
III-D Extensions on TWSVC
There are several extensions on TWSVC due to its stable performance. For instance, RTWSVC [16] replaces norm with norm in the within-cluster function, resulting in decreasing the influence of the noise and outliers further. Another extension FRTWSVC [16] uses a least squares formation to accelerate the learning speed. The third extension RampTWSVC [17] introduces the ramp loss function into TWSVC to further decrease the influence of the noise and outliers from both within-cluster and between-cluster. They construct the cluster centers by different optimization problems. By selecting , we summarize their within-cluster, between-cluster and loss functions (see Fig. 1(iv)-(vi)) as follows.
[TABLE]
where , are the user defined constants.
By substituting these loss functions (27) into our general model, it is easy to get the optimization problems of RTWSVC, FRTWSVC and RampTWSVC. In theory, RTWSVC, FRTWSVC and RampTWSVC would terminate in a finite number of steps at the weak local optimal points if they are implemented by the general model strictly. The details are omitted.
IV RFDPC
In this section, we introduce a new loss function fluctuated with the dataset, and then propose our robust fitting distribution planes for clustering (RFDPC) based on the general model.
Let us start from the efficient RampTWSVC [17]. Its ability to reduce the influence of the noise and outliers is manifested in Fig. 1. However, for the case of the samples from the same distribution, RampTWSVC may obtain very different cluster centers, leading bias from the data distribution. For instance, in Fig. 2, there are two groups of samples from (i.e., left and right three columns). RampTWSVC obtains two centers, depicted by solid blue lines in Fig. 2(b), are very different from each other.
To capture the data distribution, we introduce the 1-order and 2-order statistics [37] of the cluster into the within-cluster function and propose a new within-cluster function as
[TABLE]
where are positive parameters. and , where is the index set of the th cluster that belongs to and denotes the sample number of this cluster. In other words, and are the corresponding parts in the mean and variance of the th cluster with . The additional statistics in (28) mean that a sample assigned to a cluster would lead additional losses: (i) loss derived from the mean deviation, i.e., the deviation of the sample from the statistical center; (ii) loss derived from the variance of deviation, i.e., the deviation proportionality. Minimizing these statistics would make the cluster center close to the highest density region and the samples be uniformly distributed along with the cluster center. Fig. 2(c) shows the result by new function (28).
Then, by setting the between-cluster function , the loss function of RFDPC becomes
[TABLE]
where .
By introducing a regularization term, the subproblem in step 2(a) is considered as
[TABLE]
and its local solution can be obtained by the concave-convex procedure (CCCP) [38].
It should be pointed out that the cluster assignment (13) can be replaced by the simplified assignment (14), though the function does not satisfy properties (i)-(iii).
** IV.1****.**
In RFDPC, the sample assignment (13) can be simplified as (14).
Proof.
Suppose is the label of an arbitrary sample obtained by (14), and is an arbitrary label of . From the proof of Theorem 2.1, we just need to prove and .
Note that and are positive parameters. Since smaller leads smaller and smaller , and since is non-decreasing in , the inequality holds. Noticing that satisfies properties (i)-(iii), the inequality holds. Therefore, the conclusion is obtained. ∎
In addition, our RFDPC hires termination condition (iii), and thus it terminates in a finite number of steps at a weak local optimal point by Theorem 2.5.
V Experimental results
In this section, we analyze the performance of our RFDPC compared with some state-of-the-art partition-based clustering methods on several artificial and benchmark datasets. All the methods were implemented by MATLAB2017 on a PC with an Intel Core Duo Processor (double 4.2 GHz) with 16GB RAM. In the experiments, we used the metrics accuracy (AC) [12] and mutual information (MI) [39] to measure the performance of these methods.
On the synthetic data, we tested the ability of the plane-based clustering methods to capture the plane-based data distribution. The synthetic data in consists of three classes, where one class is on a plane and the other two classes are on two lines, respectively. The details of the synthetic data are shown in Table I. We sampled four groups from the synthetic data which include 120, 100, 80 and 60 samples, respectively. Then, the plane-based clustering methods, including kPC [8], PPC [11, 10], TWSVC [12], RTWSVC [16], FRTWSVC [16], RampTWSVC [17] and our RFDPC, were implemented on these four groups, where the parameters , and were set to , was set to , and was set to . The clustering results were depicted in Fig. 3. It can be seen from Fig. 3 that (i) kPC and TWSVC cannot capture these plane-based clusters; (ii) PPC obtains a plane constructed by the two lines frequently; (iii) RTWSVC and FRTWSVC capture the three clusters on group G1, but both of them lose a cluster when the number of samples decreases; (iv) RampTWSVC always finds three clusters inaccurately; (iv) our RFDPC finds the three clusters exactly. Thus, our RFDPC captures the plane-based clusters more precisely than other methods on the synthetic datasets.
To exhibit the relationship between sample and its cluster center, the deviation statistics of the samples from their cluster center planes were depicted in Fig. 4, where ‘-’ denotes the 1-order statistics of each cluster and ‘’ denotes the 2-order statistics of each cluster. A cluster that only has a ‘’ in Fig. 4 means its 1-order and 2-order statistics are out of the figure window. It is obvious that the 2-order statistics of deviation of kPC, PPC, and TWSVC are far from their 1-order statistics, and hence they cannot find the three plane-based clusters exactly. The cluster samples lie on their cluster center planes by RTWSVC and FRTWSVC on groups G1, G2 and G3, but they fail to find the 3rd cluster on groups G2 and G3. RampTWSVC has great fluctuation, and thus it cannot capture the three plane-based clusters exactly. Accordingly, our RFDPC captures the plane-based clusters well by adding additional statistics. The quantitative measurements were reported in Table II, and the highest ones were bold. Apparently, our RFDPC owns the highest performance on the four groups than other plane-based clustering methods, which is consistent with the previous observations.
In the following experiment, we implemented the above methods and kmeans [32] on several benchmark datasets [40] for linear and nonlinear cases. Typically, was set to , was set to , and . Other parameters in these methods were selected from . For nonlinear case, Gaussian kernel [41, 10, 42] was used and its parameter was selected from . The random initialization was fused for kmeans, and the NNG initialization [12] was fused for the rest plane-based clustering methods to obtain stable performances. We reported AC and MI of these methods in Tables III and IV for linear and nonlinear cases, respectively. Thereinto, kmeans was implemented ten times, and then the mean value and standard deviation were computed and reported. The highest ACs or MIs are bold, and the numbers of datasets with highest AC, MI and both are also shown in these tables. From Table III, it can be seen that our RFDPC outperforms other methods on most of the datasets. Our RFDPC has the highest AC on 18/23 datasets, the highest MI on 12/23 datasets, and both of them on 12/23 datasets. Moreover, our RFDPC is comparable with the methods that own the highest AC or MI on most of the rest datasets. Table IV has similar results to that of Table III and confirms the observation from Table III. To exhibit the cluster center planes obtained by these plane-based clustering methods, we depicted the deviation statistics on the datasets “Haberman”, “Iris”, “Pathbased” and “Vehicle” in Fig. 5 as instances. Obviously, our RFDPC has a small and tight 2-order deviation statistics around the 1-order statistics, which improves the performance of plane-based clustering significantly.
VI Conclusions
A general model for plane-based clustering has been proposed by introducing loss function and regularization. It has been shown that the general model terminates in a finite number of steps at the local or weak local optimal points theoretically. The existing plane-based clustering methods, including kPC, PPC, TWSVC, RTWSVC, FRTWSVC and RampTWSVC, are consistent with this general model. Furthermore, a new plane-based clustering method (RFDPC) based on the general model has been proposed. Experimental results on the synthetic and public available datasets have indicated that our RFDPC can capture the data distribution more precisely. For practical convenience, the corresponding RFDPC Matlab code has been uploaded upon http://www.optimal-group.org/Resources/Code/RFDPC.html. In the future work, it is interesting to find more efficient loss functions and generalization terms in the general model to suit for specific clustering purpose.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] A. Jain, M. Murty, and P. Flynn, “Data clustering: a review,” ACM computing surveys (CSUR) , vol. 31, no. 3, pp. 264–323, 1999.
- 2[2] M. Aldenderfer and R. Blashfield, Cluster Analysis . Los Angeles: Sage Publications, 1985.
- 3[3] X. Pei, C. Chen, and W. Gong, “Concept factorization with adaptive neighbors for document clustering,” IEEE Transactions on Neural Networks and Learning Systems , vol. 29, no. 2, pp. 343–352, 2018.
- 4[4] I. Dhillon, Y. Guan, and B. Kulis, “Kernel k-means: spectral clustering and normalized cuts,” The tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pp. 551–556, 1988.
- 5[5] A. Rodriguez and A. Laio, “Clustering by fast search and find of density peaks,” Science , vol. 344, no. 6191, pp. 1492–1496, 2014.
- 6[6] X. Huang, Y. Ye, and H. Zhang, “Extensions of kmeans-type algorithms: a new clustering framework by integrating intracluster compactness and intercluster separation.” IEEE Transactions on Neural Networks and Learning Systems , vol. 25, no. 8, pp. 1433–1446, 2014.
- 7[7] P. Bradley, O. Mangasarian, and W. Street, “Clustering via concave minimization,” Advances in Neural Information Processing Systems , vol. 9, pp. 368–374, 1997.
- 8[8] P. Bradley and O. Mangasarian, “k-plane clustering,” Journal of Global Optimization , vol. 16, no. 1, pp. 23–32, 2000.
