Fast Fusion Clustering via Double Random Projection

Hongni Wang; Na Li; Yanqiu Zhou; Jingxin Yan; Bei Jiang; Linglong Kong; Xiaodong Yan

PMC · DOI:10.3390/e26050376·April 28, 2024

Fast Fusion Clustering via Double Random Projection

Hongni Wang, Na Li, Yanqiu Zhou, Jingxin Yan, Bei Jiang, Linglong Kong, Xiaodong Yan

PDF

Open Access

TL;DR

This paper introduces a faster and more accurate clustering method using random projections to improve computational efficiency and results.

Contribution

The novel double random projection ADMM algorithm improves fusion clustering speed and accuracy in high-dimensional data.

Findings

01

The new algorithm significantly increases computational speed by reducing complexity.

02

Multiple random projections improve clustering accuracy under a new evaluation criterion.

03

The algorithm's convergence is proven and validated on simulated and real data.

Abstract

In unsupervised learning, clustering is a common starting point for data processing. The convex or concave fusion clustering method is a novel approach that is more stable and accurate than traditional methods such as k-means and hierarchical clustering. However, the optimization algorithm used with this method can be slowed down significantly by the complexity of the fusion penalty, which increases the computational burden. This paper introduces a random projection ADMM algorithm based on the Bernoulli distribution and develops a double random projection ADMM method for high-dimensional fusion clustering. These new approaches significantly outperform the classical ADMM algorithm due to their ability to significantly increase computational speed by reducing complexity and improving clustering accuracy by using multiple random projections under a new evaluation criterion. We also…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Homo sapiens(human · species)

Diseases1

injury to people or property

Figures8

Click any figure to enlarge with its caption.

Funding9

—National Key R&D Program of China
—the National Natural Science Foundation of China
—the National Statistical Science Research Project
—Jinan Science and Technology Bureau
—the China Academy of Engineering Science and Technology Development Strategy Shandong Research Institute Consulting Research Project
—the State Scholarship Fund from China Scholarship Council
—the Alberta Machine Intelligence Institute (AMII)
—Natural Sciences and Engineering Council of Canada (NSERC)
—Canada Research Chair program from NSERC

Keywords

unsupervised learningrandom projectionADMM algorithmfusion clustering

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBayesian Methods and Mixture Models · Anomaly Detection Techniques and Applications · Advanced Clustering Algorithms Research

Full text

1. Introduction

Clustering is a pivotal technique in unsupervised learning, applied extensively across various scientific and technological fields that handle large datasets. Clustering also plays a crucial role in data labelling, which sets the stage for the application of artificial intelligence and machine learning models [1,2] on the organized data to perform predictive analytics and classification tasks. Traditional clustering algorithms like k-means, Gaussian mixture models, and hierarchical clustering often face stability challenges due to their concave optimization formulations, which can lead to variability in results due to factors such as initial conditions or data outliers [3,4,5]. Recent advancements in convex or concave fusion methods have shown promise in enhancing stability, achieving more consistent global or local optimality and reliable estimation of cluster centers and counts through sparse-inducing penalties on pairwise centers [6,7,8,9]. For clustering high-dimensional data, the data can be mapped into a high-dimensional feature space (kernel space) for processing [10], or clustering can be achieved by optimizing a smooth and continuous objective function that is based on robust statistics [11]. This paper introduces a comprehensive empirical validation of these methods across simulation studies and real data analysis, detailing their improved stability over traditional methods and the practical implications of these advancements.

In fusion clustering, p-dimensional observations $[eqn]$ , $[eqn]$ are each parameterized by their own centroid $[eqn]$ . These centroids are estimated under the assumption that all observations can be grouped into K clusters $[eqn]$ , such that for $[eqn]$ , $[eqn]$ , where $[eqn]$ represents the cluster center for observations in cluster $[eqn]$ . Fusion clustering aims to concurrently estimate the cluster centroids $[eqn]$ and the partitions $[eqn]$ by minimizing the following objectives

[eqn]

The penalty function $[eqn]$ is used to control the complexity of the model, and it is determined by the tuning parameter $[eqn]$ . The form of the norm used is represented by $[eqn]$ . This penalty function is typically used in fusion clustering to encourage sparsity in the estimated cluster centroids.

The penalty function $[eqn]$ controls the complexity of the model and is determined by the tuning parameter $[eqn]$ . The norm used is $[eqn]$ . The penalty function is typically used in fusion clustering to promote sparsity in cluster centroids.

Convex fusion clustering methods have been widely studied due to their computational simplicity and ability to find global optima. These methods often employ $[eqn]$ , $[eqn]$ , or $[eqn]$ penalties as the penalty function $[eqn]$ [12,13,14,15,16,17]. However, convex fusion can lead to biased estimates of the individual centroids, resulting in solutions with a large number of dense clusters [18,19]. To address this issue, researchers have proposed using concave fusion clustering methods, such as those using minimax concave penalties (MCPs) [20], truncated Lasso penalties (TLPs) [8], and arbitrary concave penalties.

While robust, convex and concave fusion clustering methods are computationally demanding with a $[eqn]$ complexity, which can limit their practicality in scenarios involving large sample sizes n and high-dimensional datasets p. This article proposes a strategy for overcoming this limitation using random projection techniques [21,22,23,24]. The approach involves the construction of a random diagonal matrix whose diagonal elements are sourced from a binary distribution. This matrix is then projected onto the pairwise component of the fusion method. By doing so, the number of pairwise differences between individual centroids, $[eqn]$ , is substantially reduced. This reduction not only decreases the computational load but also maintains the integrity of the clustering process, enhancing the algorithm’s scalability without excessively increasing the operational overhead. We provide empirical evidence demonstrating that this method significantly reduces the computational time while preserving the clustering quality, as shown in our simulation section.

In unsupervised learning, rapid clustering processes are crucial for handling large datasets efficiently. Our study introduces a novel approach to fusion clustering to enhance computational speed without compromising accuracy. Our contributions are summarized as follows: (1) We propose using random projection techniques to simplify the fusion aspect of clustering, effectively diminishing the pairwise centroids discrepancies and significantly boosting computational efficiency by minimizing the fusion step’s complexity. (2) We have developed a novel double recursive random projection ADMM method designed for efficient high-dimensional fusion clustering, improving the accuracy of clustering.

In the remainder of this paper, the proposed new ADMM algorithm will be described in Section 2. This section will also include an analysis of the computational complexity and convergence of the algorithm. It will also include a strategy for improving cluster accuracy. The finite-sample properties of the proposed new ADMM algorithm will be evaluated through simulation studies in Section 3, and the method will be demonstrated using a real data example in Section 4. Concluding remarks will be presented in Section 5, and technical proofs will be provided in the Appendix A and Appendix B.

2. Methodology

To improve convex or concave fusion clustering efficiency, we propose an extension of the classical ADMM algorithm based on a random projection called RP-ADMM. A random projection can significantly reduce the time and computational resources needed to analyze high-dimensional data, making it suitable for large datasets and real-time processing. In this section, we will discuss the RP-ADMM algorithm’s computational complexity and convergence.

2.1. Random Projection Based ADMM

Previous ADMM algorithms for convex or concave fusion clustering [6,8] have suffered from a high computational burden due to the need to consider all $[eqn]$ pairwise differences between individual centroids. This is represented by the fusion matrix $[eqn]$ , where $[eqn]$ is the ith unit vector with a 1 in the ith position and 0s elsewhere, and $[eqn]$ can be interpreted as the difference between the ith and jth individual centroids. The computational complexity of this approach is $[eqn]$ , which becomes infeasible for large sample sizes n.

Bernoulli distribution-based random projections ADMM

It is worth noting that pairwise differences between individual centroids can be deduced from other differences. For example, if we know that $[eqn]$ and $[eqn]$ , we can conclude that $[eqn]$ . This means that it may be unnecessary to consider the row $[eqn]$ in $[eqn]$ . To reduce the computational burden of convex or concave fusion clustering, we propose a random projection approach. This only considers a small subset of the $[eqn]$ pairwise differences between individual centroids. This is achieved by generating indicators $[eqn]$ from a Bernoulli distribution with probability $[eqn]$ . We then form a random matrix $[eqn]$ , which is a diagonal matrix with diagonal elements $[eqn]$ . If $[eqn]$ , the difference between $[eqn]$ and $[eqn]$ is taken into account; if $[eqn]$ , it is not considered. The probability $[eqn]$ controls the size of the subset of pairwise differences considered. The matrix $[eqn]$ can be seen as a projection of $[eqn]$ onto a sparse matrix. This is with about $[eqn]$ rows being zero vectors and about $[eqn]$ ones being nonzero vectors. This projection is based on a Bernoulli distribution. Finally, we form a new fusion matrix $[eqn]$ by deleting the rows of zero vectors in $[eqn]$ . The new fusion matrix is given by $[eqn]$ , where $[eqn]$ $[eqn]$ denotes jth row vector of $[eqn]$ .

We just consider $[eqn]$ in (1) for simplicity and propose a random projection-based fusion criterion by

[eqn]

where $[eqn]$ . Furthermore, the objective function in (2) is equivalent to

[eqn]

where $[eqn]$ , $[eqn]$ . Under the constraints in (3), the augmented Lagrangian $[eqn]$ has the form

[eqn]

where the dual variables $[eqn]$ are Lagrange multipliers, and $[eqn]$ is a tuning parameter. Under the iterative value $[eqn]$ and $[eqn]$ at the mth step, we conduct the Bernoulli distribution-based random projection ADMM (RP-ADMM) iterative algorithm and compute the estimates of $[eqn]$ as follows:

[eqn]

[eqn]

[eqn]

where $[eqn]$ equals

[eqn]

and $[eqn]$ equals

[eqn]

Ma and Huang (2017) [18] have argued that under (8), the element $[eqn]$ of $[eqn]$ is the minimizer of $[eqn]$ , where $[eqn]$ . For different thresholding operator $[eqn]$ , the estimate $[eqn]$ has different results. Such as,

For the Lasso penalty [25],

[eqn]

For SCAD penalty [26] with $[eqn]$ ,

[eqn]

For the MCP [27] with $[eqn]$ ,

[eqn]

For the TLP [8] with $[eqn]$ ,

[eqn]

Through some algebra, the problem of (9) is equivalent to the minimization of the function $[eqn]$ , which has the from

[eqn]

Under the given value of $[eqn]$ , $[eqn]$ , the updated $[eqn]$ are

[eqn]

where $[eqn]$ is $[eqn]$ identity matrix. $[eqn]$ and $[eqn]$ are updated according to the random projection ADMM iterative algorithm (5)–(7) until the input of some convergence criteria, such as both dual and primal residuals being close to zero [28] in our practice. The convergence time of ADMM is highly related to the penalty parameter $[eqn]$ . A poor selection of $[eqn]$ can result in a slow convergence for the ADMM algorithm [29] and thus RP-ADMM. In this paper, we fix $[eqn]$ throughout for simplicity.

To facilitate the updates of $[eqn]$ at the $[eqn]$ th step in (5) to (7) of the RP-ADMM iterative algorithm, we need to specify a proper initial value (warm start). Here, we set $[eqn]$ , $[eqn]$ and obtain the initial estimators $[eqn]$ as the minimizer of a ridge fusion criterion

[eqn]

We summarize the above analysis in Algorithm 1. Algorithm 1 RP-ADMM for fusion clusteringInput: data $[eqn]$ ; Initialize $[eqn]$ , $[eqn]$ ; tuning parameter, $[eqn]$ Output: an estimate of $[eqn]$ for $[eqn]$ do compute $[eqn]$ using (5) compute $[eqn]$ using (6) compute $[eqn]$ using (7) if convergence criterion is met, then Stop and denote the last iteration by $[eqn]$ , else $[eqn]$ end if end for

Practically, we would not want to conduct the RP-ADMM updates comprehensively until convergence to save computing time in the first iterations. Another trick is to adopt the initial values of subsequent convex relaxations as optimal values from the previous relaxed convex problem, which significantly reduces the number of RP-ADMM iterations.

2.2. Selection of Optimal Tuning Parameter

For a given $[eqn]$ , the converging value $[eqn]$ of the above RP-ADMM procedure is defined as

[eqn]

where $[eqn]$ is defined in (2) and the optimal value of $[eqn]$ can be selected via a properly constructed data-driven criterion. In particular, we partition the support of $[eqn]$ into a grid of $[eqn]$ , and for each $[eqn]$ , we compute a solution path of $[eqn]$ and obtain $[eqn]$ distinct cluster centroids $[eqn]$ , The optimal $[eqn]$ is selected by minimizing a data-driven BIC, i.e., $[eqn]$ , where

[eqn]

Subsequently, we obtain the estimator $[eqn]$ , and the individuals can be separated into $[eqn]$ clusters accordingly, i.e., $[eqn]$ , $[eqn]$ .

Other methods for tuning parameters in clustering, such as generalized degrees of freedom with generalized cross-validation [8] and stability-based cross validation [25,30] can provide good results but may require extensive computation or the specification of a hyperparameter perturbation size [8]. In contrast, the proposed BIC is easy to compute and performs well in estimating cluster centroids and the true number of clusters (K). Figure 1 shows the change in BIC values against $[eqn]$ and the cluster number of the simulation. Across all cases with different values of n and p, we observe that BIC( $[eqn]$ ) decreases as the value of $[eqn]$ increases. With recovering the true cluster number $[eqn]$ , BIC( $[eqn]$ ) reaches a minimum at the optimal $[eqn]$ . Moreover, when $[eqn]$ keeps increasing, the cluster centroids are continuously integrated, and BIC( $[eqn]$ ) is enlarged. However, further research is needed to fully prove the consistency of the BIC in combination with the objective function (2).

2.3. Recursive RP-ADMM and Cluster Matrix

In the above cluster analysis, the effect of randomness on the clustering results was not considered. However, empirical analysis has shown that the impact of this randomness on the estimated cluster centers and numbers is minimal (i.e., $[eqn]$ ’s and $[eqn]$ ’s). However, the impact on the final partitioning results (i.e., which observations are grouped into a single cluster) can be significant. In response to this, we propose the Recursive RP-ADMM (RRP-ADMM) procedure, which performs multiple RP-ADMM cluster analyses by generating M random matrices (i.e., $[eqn]$ ’s, $[eqn]$ ) and repeatedly conducting the analysis.

Once the multiple RP-ADMM cluster analyses have been completed, we must summarize the results. We define a $[eqn]$ symmetric cluster matrix $[eqn]$ where $[eqn]$ denotes that the ith and jth observations belong to the same cluster; otherwise, $[eqn]$ . Another $[eqn]$ symmetric matrix $[eqn]$ is introduced, with element $[eqn]$ representing the relative frequency of the ith and jth observations belonging to the same cluster over the M independent RP-ADMM clustering procedures. The decision of whether the ith and jth observations should be grouped into a single cluster or not can then be treated as a classification problem, with the two possible class labels being 1 (belong to the same cluster) or 0 (do not belong to the same cluster). We can use an indicator function to transform the relative frequency into class labels and generate an estimator for the cluster matrix $[eqn]$ , i.e.,

[eqn]

where $[eqn]$ denotes the indicator function. We summarize the above procedure in Algorithm 2. This transformation can be understood as a voting-based aggregation strategy, similar to the one proposed by [31], which aims to reduce misclassification errors and improve the accuracy of the clustering. To evaluate the accuracy of the clustering results, we define a new measure called the similarity index (SI) between two data clusterings:

[eqn]

Like the Rand Index (RI) measure [32], the newly introduced evaluation criterion can be seen as a measure of the percentage of correct decisions made by some algorithm. The SI values also range from 0 to 1, with lower values indicating better algorithm performance. Algorithm 2 RRP-ADMM for fusion clusteringInput: data $[eqn]$ ; M; Initialize $[eqn]$ , $[eqn]$ ; tuning parameter, $[eqn]$ Output: an estimate of $[eqn]$ for $[eqn]$ , M do compute $[eqn]$ using RP-ADMM end for while $[eqn]$ do compute $[eqn]$ and $[eqn]$ from (13) end while

The classical convex or concave fusion clustering procedure in (1) requires $[eqn]$ operations and $[eqn]$ of storage for a single round of ADMM updates with primal and dual residual calculations, because all pairs of centroids are shrunk together in this method.

The RP-ADMM algorithm significantly improves computational efficiency compared to classical ADMM algorithm. It requires only $[eqn]$ of storage, compared to $[eqn]$ for the classical ADMM algorithm, because the variables $[eqn]$ and $[eqn]$ have only $[eqn]$ columns rather than $[eqn]$ . Additionally, the RP-ADMM algorithm requires only $[eqn]$ operations for its most computationally demanding step, in comparison to $[eqn]$ for the classical ADMM algorithm. The RP-ADMM algorithm also requires $[eqn]$ operations to conduct Cholesky factorization in every iteration, in comparison to $[eqn]$ for the classical ADMM algorithm. This efficient Cholesky factorization is computed only once and reused across repeated RP-ADMM updates.

At the end of this subsection, we will demonstrate the convergence of the RP-ADMM algorithm by showing that the sequence generated by the algorithm contains a subsequence that converges to a stationary point.

Lemma 1. Let $[eqn]$ be the sequence generated by Algorithm 1, then for some constant $[eqn]$ ,

[eqn]

In order to prove that the sequence $[eqn]$ is convergent, we need to assume that $[eqn]$ is bounded and $[eqn]$ which are often observed in numerical tests.

Theorem 1. If $[eqn]$ are bounded and $[eqn]$ , then $[eqn]$ is bounded. Moreover, there exist a subsequence $[eqn]$ , such that

[eqn]

and thus, $[eqn]$ has a subsequence which converges to the stationary point.

3. Simulation

In this part of the study, simulation experiments were conducted to compare the performance of the extended and classical ADMM clustering algorithms in terms of computational time and clustering accuracy, using the evaluation criterion in (14). The Lasso-based fusion method often leads to the formation of dense clusters with a minor penalty for small differences in $[eqn]$ , which can result in the formation of many spurious clusters with very small differences among them [6]. In contrast, the concave penalty method tends to produce a clear cluster structure and a well-defined number of clusters [8]. Therefore, in this study, we focus on the MCP-based fusion method [27] which compares the conventional ADMM’s clustering performance and the proposed new ADMM algorithm.

3.1. Low-Dimensional Setting

In this part, we evaluated the clustering performance of the classical ADMM, RP-ADMM, and RRP-ADMM algorithms on low-dimensional synthetic data generated from three overlapping convex clusters with the same spherical shape in some number of dimensions p and sample size n. The synthetic data were generated from three populations $[eqn]$ , $[eqn]$ with $[eqn]$ , $[eqn]$ , $[eqn]$ , $[eqn]$ and $[eqn]$ with $[eqn]$ and $[eqn]$ for $[eqn]$ . This setting was chosen deliberately to allow overlap in the sample sets generated from clusters proximal to each other, thereby increasing the complexity of the clustering task. As illustrated in Figure 2c, the clustering performance using a single random projection (RP-ADMM) was suboptimal, indicating challenges with cluster separability under this setup. Conversely, Figure 2b demonstrates that recursive random projection (RRP-ADMM) significantly improved clustering results. The recursive times for the RP-ADMM and RRP-ADMM algorithms were set to $[eqn]$ .

To evaluate the accuracy of the RP-ADMM, relax-and-split approach [33] (RS-ADMM) and RRP-ADMM algorithms in recovering the true cluster matrix, we generated a random sample of $[eqn]$ observations with 1–20 drawn from $[eqn]$ , 21–40 drawn from $[eqn]$ , and 41–60 drawn from $[eqn]$ , and set the number of dimensions to $[eqn]$ . The probability $[eqn]$ of generating a 1 in the random matrix was set to $[eqn]$ , where c controls the probability size. The level plots in Figure 2 use colour to visualize the values of 1’s and 0’s in the cluster matrix. The results show that both RP-ADMM and RRP-ADMM can accurately recover the true cluster matrix, with RRP-ADMM showing more accurate gradation than the true cluster matrix. Single random projection (RP-ADMM) can cause high variance in clustering outcomes due to the randomness of the sampling process. To mitigate this issue, we have adopted the voting-based pooling technique [31], which reduces variance by averaging results from recursive random projection (RRP-ADMM).

To further evaluate the performance of the algorithms, we calculated the values of the index SI defined in (14) after 100 replicates under different c choices. We depicted the results as boxplots in Figure 3. These results show that RRP-ADMM consistently improves clustering accuracy compared to RP-ADMM, as evidenced by the smaller median and standard error of SI values.

Next, we will compare the performance of classical ADMM and RRP-ADMM in terms of computation time per iteration and the SI after 100 trials. The sample size is varied with $[eqn]$ points and $[eqn]$ , while $[eqn]$ is kept constant. In this study, we have limited the number of points to 360, as the classical ADMM algorithm requires a significant amount of computation time for a single realization with more points. We will also compare the performance of the Similarity Index (SI) and Rand Index (RI) in evaluating the clustering results. Therefore, we should calculate the partitioning structure of all points based on the estimated cluster matrix graph. This process involves first identifying the point $[eqn]$ with the most neighbors and aggregating the connected points with point $[eqn]$ as cluster 1, then finding the second point $[eqn]$ with the most edges to form cluster 2, and repeating this process until there are no more points remaining.

Table 1 shows the mean values of the SI, RI, and the consumed time in seconds for different sample sizes under different methods after 100 replicates. Based on the data in Table 1, we can observe the following: (i) The proposed RRP-ADMM significantly reduces the time required for convex or concave fusion clustering, especially when the sample size increases. (ii) RRP-ADMM produces smaller SI and larger RI values, possibly due to the voting-based pooling technique improving cluster accuracy. (iii) As the sample size increases, the SI and RI values decrease. The boxplots in Figure 4 and Figure 5 demonstrate the superiority of the RRP-ADMM algorithm over the classical ADMM algorithm in terms of both the SI values and the square root of run time, as seen in the results obtained from 100 replicates with four different sample sizes. These results further reinforce our belief in the effectiveness of the RRP-ADMM algorithm.

3.2. High-Dimensional Setting

In this part, we investigate using the double random projection-based alternating direction method of multiplier (DRP-ADMM and DRRP-ADMM) algorithms for clustering high-dimensional data sets. We employ a recursive Gaussian distribution-based random projection strategy in the first step to mitigate the impact of randomness on cluster results. Since the classical ADMM algorithm is computationally intensive in high-dimensional settings, we focus on evaluating the performance of the DRP-ADMM and DRRP-ADMM algorithms with recursive times $[eqn]$ , using three Gaussian random projections in the outer layer and three binary random projections in the inner layer. The simulated data sets consist of two overlapping convex clusters with the same spherical shape. They are generated using a population $[eqn]$ , $[eqn]$ with $[eqn]$ , $[eqn]$ . Furthermore, $[eqn]$ with $[eqn]$ and $[eqn]$ for $[eqn]$ . We consider four high-dimensional cases with $[eqn]$ and a fixed sample size of $[eqn]$ .

We evaluate the accuracy of the DRP-ADMM and DRRP-ADMM algorithms in recovering the true cluster matrix. To do this, we first generate a Gaussian random matrix $[eqn]$ with dimensions $[eqn]$ in the first projection. The elements of $[eqn]$ correspond to $[eqn]$ . We set $[eqn]$ with $[eqn]$ and $[eqn]$ . See [21,23] for the number of projections. In the second step, we generate a diagonal binary random matrix with probability $[eqn]$ of equaling one. Then, we calculate the values of the SI index defined in Equation (14) and plot the results as boxplots in Figure 6 after 100 replicates for different values of p. The results show that the DRRP-ADMM algorithm consistently outperforms the DRP-ADMM algorithm regarding the median and standard error of the SI values for all values of p, indicating that the DRRP-ADMM algorithm improves clustering accuracy.

4. Real Data Analysis

In this study, we use the DrivFace dataset to demonstrate the effectiveness of our proposed clustering procedure. The DrivFace database consists of $[eqn]$ images of 640,480 pixels each, captured from four drivers (two women and two men) over different days and containing $[eqn]$ facial features such as glasses and beards. Each driver’s images containing similar facial features can be grouped into one cluster, resulting in a total of $[eqn]$ clusters as shown in Figure 7a. Firstly, we know the true labels of the dataset; that is, there are four clusters, and we also know which observations belong to the common cluster. Secondly, because the similarity among observations in the pictures is very high across different clusters, it is challenging to separate them. Therefore, we can use this dataset to evaluate our proposed clustering method.

Due to the large sample size of the DrivFace dataset, we do not use the classical ADMM algorithm, which would require $[eqn]$ operations in a single ADMM iteration. Instead, we first scale the samples by each feature and apply the RP-ADMM procedure to estimate individual centers using a grid of $[eqn]$ values. We plot the $[eqn]$ of four selected variables in Figure 8, and the scrutiny of Figure 8a implies that some outlying points (influential points) cause the clusters to be dense. We then remove these 55 points and plot a new $[eqn]$ in Figure 8b. The optimal $[eqn]$ value, as determined by the developed BIC criterion in Equation (12), is $[eqn]$ , indicating that the estimated number of clusters is four, the same as the number of drivers. We apply the proposed RRP-ADMM algorithm with a Bernoulli-distribution-based random projection procedure to further improve the cluster accuracy using $[eqn]$ and a recursive number $[eqn]$ . Using the estimated optimal tuning parameter of $[eqn]$ , we obtain the estimated cluster matrix in Figure 7b, which closely resembles the true cluster matrix in Figure 7a. The calculated similarity index (SI) value is $[eqn]$ . Moreover, the value of Adjusted Rand Index (ARI) is 0.672.

5. Conclusions

We propose using the recursive random projection-based ADMM (RRP-ADMM) method to improve the speed and accuracy of convex and nonconvex fusion clustering. In simulations and real data examples, the RRP-ADMM method demonstrates superior performance in fast calculation and accurate clustering results. The RRP-ADMM algorithm is scalable and can be applied to deal with heterogeneous issues in any setting that involves fusion techniques.

However, some challenges still need to be addressed in this field. One challenge is efficiently transforming the cluster matrix graph into the target partitioning structure and determining the optimal number of clusters. Another challenge is using prior information about which points are more likely to be integrated into a single cluster to reduce the number of pairwise comparisons. Additionally, a further study is needed to determine the theoretical probability of achieving a probability of one in binary random projection. Another future research direction involves performing clustering simultaneously with feature selection, using techniques such as incorporating feature weights [34] or introducing sparsity [14].

Bibliography34

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Haq M.A. CDLSTM: A novel model for climate change forecasting Comput. Mater. Contin.202271210.32604/cmc.2022.023059 · doi ↗
2Haq M.A. SMOTEDNN: A novel model for air pollution forecasting and AQI classification Comput. Mater. Contin.202271110.32604/cmc.2022.021968 · doi ↗
3Van Der Kloot W.A. Spaans A.M.J. Heiser W.J. Instability of hierarchical cluster analysis due to input order of the data: The Permu CLUSTER solution Psychol. Methods 20051046810.1037/1082-989X.10.4.46816393000 · doi ↗ · pubmed ↗
4Xu R. Wunsch D. Survey of clustering algorithms IEEE Trans. Neural Netw.20051664567810.1109/TNN.2005.84514115940994 · doi ↗ · pubmed ↗
5Yang X. Yan X. Huang J. High-dimensional integrative analysis with homogeneity and sparsity recovery J. Multivar. Anal.201917410452910.1016/j.jmva.2019.06.007 · doi ↗
6Chi E.C. Lange K. Splitting methods for convex clustering J. Comput. Graph. Stat.201524994101310.1080/10618600.2014.94818127087770 PMC 4830509 · doi ↗ · pubmed ↗
7Lindsten F. Ohlsson H. Ljung L. Clustering using sum-of-norms regularization: With application to particle filter output computation Proceedings of the 2011 IEEE Statistical Signal Processing Workshop (SSP)Nice, France 28–30 June 201120120410.1109/SSP.2011.5967659 · doi ↗
8Pan W. Shen X. Liu B. Cluster Analysis: Unsupervised Learning via Supervised Learning with a Non-convex Penalty J. Mach. Learn. Res.201314186524358018 PMC 3866036 · pubmed ↗