Hybrid Density- and Partition-based Clustering Algorithm for Data with Mixed-type Variables
Shu Wang, Jonathan G. Yabes, Chung-Chou H. Chang

TL;DR
This paper introduces HyDaP, a hybrid clustering algorithm designed for mixed-type data, combining density and partition methods with variable selection, demonstrated through simulations and application to health records.
Contribution
The paper presents a novel two-step hybrid clustering algorithm that effectively handles mixed continuous and categorical data with variable selection and a new dissimilarity measure.
Findings
HyDaP outperforms existing methods in simulations.
Effective variable selection improves clustering accuracy.
Successfully identified sepsis phenotypes in health records.
Abstract
Clustering is an essential technique for discovering patterns in data. The steady increase in amount and complexity of data over the years led to improvements and development of new clustering algorithms. However, algorithms that can cluster data with mixed variable types (continuous and categorical) remain limited, despite the abundance of data with mixed types particularly in the medical field. Among existing methods for mixed data, some posit unverifiable distributional assumptions or that the contributions of different variable types are not well balanced. We propose a two-step hybrid density- and partition-based algorithm (HyDaP) that can detect clusters after variables selection. The first step involves both density-based and partition-based algorithms to identify the data structure formed by continuous variables and recognize the important variables for clustering; the second…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Hybrid Density- and Partition-based Clustering Algorithm for Data with Mixed-type Variables
Shu Wang
Department of Biostatistics, College of Public Health and Health Professions, University of Florida
University of Florida Health Cancer Center
Jonathan G. Yabes
Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh
Department of Medicine, School of Medicine, University of Pittsburgh
Department of Clinical and Translational Science, School of Medicine, University of Pittsburgh
Chung-Chou H. Chang
Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh
Department of Medicine, School of Medicine, University of Pittsburgh
Department of Clinical and Translational Science, School of Medicine, University of Pittsburgh
Clustering is an essential technique for discovering patterns in data. The steady increase in amount and complexity of data over the years led to improvements and development of new clustering algorithms. However, algorithms that can cluster data with mixed variable types (continuous and categorical) remain limited, despite the abundance of data with mixed types particularly in the medical field. Among existing methods for mixed data, some posit unverifiable distributional assumptions or that the contributions of different variable types are not well balanced.
We propose a two-step hybrid density- and partition-based algorithm (HyDaP) that can detect clusters after variables selection. The first step involves both density-based and partition-based algorithms to identify the data structure formed by continuous variables and recognize the important variables for clustering; the second step involves partition-based algorithm together with a novel dissimilarity measure we designed for mixed data to obtain clustering results. Simulations across various scenarios and data structures were conducted to examine the performance of the HyDaP algorithm compared to commonly used methods. We also applied the HyDaP algorithm on electronic health records to identify sepsis phenotypes.
KEY WORDS: Clustering, Mixed data, Variable selection
1 Introduction
In precision medicine, the prevention and treatment strategies are tailored according to individual characteristics. Such practice has been greatly improved by using information obtained from large databases (Council et al., , 2011) including electronic health record (EHR) which contains patient information such as demographics, daily charts, medical history, lab results, medication use, billing information and others (Häyrinen et al., , 2008). In order to efficiently process data and extract useful information, machine learning methods are often applied (Coorevits et al., , 2013). Clustering is an important aspect of unsupervised machine learning methods which aims to uncover hidden patient subgroups that may have different diagnoses and treatment responses in EHR data. Further investigations on these subgroups together with current clinical guidelines could help design precision medicine strategies to further assist physicians in providing better patient care (Jensen et al., , 2012).
The basic concept of clustering is to divide individuals into a number of subgroups such that individuals within the same subgroup have more similar characteristics, as defined by a set of variables, than the individuals who belong to different subgroups. One of the main challenges in clustering is how to define “dissimilarity” between subjects with data of mixed variable types (continuous and categorical). If all variables are continuous, we can view the collection of information from an individual as a data point, or a vector of variables in a high-dimensional covariate space. The distance between the data points of two individuals is used to determine the dissimilarity between these two subjects so that a closer distance indicates lower dissimilarity. If all variables are categorical, dissimilarity measures (or similarity measures) were proposed to evaluate how often two individuals are in the same category among those variables. In this context we will use “distance” and “dissimilarity” interchangeably. Gower distance (Gower, , 1971), distance defined in factorial analysis of mixed data (FAMD) (Pagès, , 2014), and K-prototypes (Huang, , 1998) are possible methods to address the above mentioned issue.
Gower distance was proposed to measure dissimilarity between subjects with mixed types of variables. The distance measure used in FAMD can be applied on mixed data as well, even though FAMD was not originally intended for clustering. Distance measure defined in K-prototypes is similar to Gower distance, but it incorporates a user-defined weight for each type of variables. Therefore, K-prototypes assumes that all categorical variables have the same weight, and that all continuous variables have the same weight. This design may not be practical if within the same variable type, some are clinically more important than others in terms of clustering.
Finite mixture model (FMM) (McCutcheon, , 1987; Moustaki, , 1996) is a model-based clustering method that bypasses the challenge of defining dissimilarity between subjects with mixed types of variables. It assumes that the data is a mixture of several parametric distributions. The unknown distributional parameters including cluster membership can be solved via maximizing likelihood using the expectation-maximization (EM) algorithm. Moreover, it is able to transfer the task of selecting the optimal number of clusters into model selection problem which is much more straightforward. However, its main drawback is that all the distributional assumptions are conditional on the unknown cluster, making those assumptions unverifiable from the data.
In order to identify cluster memberships, it is also important to know the underlying data structure. For example, whether distinct clusters exist in the feature space; or if no natural clusters exist but the data is heterogeneous enough to be partitioned. Such information is crucial in understanding data, selecting clustering methods, and interpreting clustering results. However, to our knowledge, none of the existing methods incorporates this data structure information into clustering.
To address the limitations of the existing methods, we propose a Hybrid Density- and Partition-based (HyDaP) algorithm to identify clusters for data with mixed types of variables and use this method to discover sepsis phenotypes using demographic and clinical data in EHR for sepsis patients at university affiliated hospitals.
In Section 2 we introduce the most commonly used dissimilarity measures and clustering algorithms; in Section 3 we define three data structures and propose a new clustering algorithm, HyDaP; in Section 4 we present performance comparisons among different methods under various simulation settings; in Section 5 we demonstrate the use of HyDaP algorithm to identify sepsis phenotypes; and Section 6 is discussion.
2 Review of dissimilarity measures and clustering algorithms
In this section, we briefly review some existing dissimilarity measures and clustering algorithms. In addition, we discuss the pros and cons of each measure or algorithm.
2.1 Dissimilarity measures
Minkowski distance is a family of dissimilarity measures for numeric variables. Let be a vector representing variables of subject . For subjects and , Minkowski distance between the two is defined as follows:
[TABLE]
where is related to the shape of unit circle which is a two-dimensional contour with every point on the contour at distance of 1 from the center . Different choices of lead to different distance measures. For example, leads to the famous Euclidean distance which is intuitive and able to represent physical distances. When , we obtain Manhattan distance which is often used to detect hyperrectangular clusters. When , we obtain Chebyshev (maximum) distance which is the same as chess board distance since it is defined as the greatest value of the differences among all dimensions. A potential problem of using the Minkowski distance is that variables with larger variances tend to dominate the others (Xu and Wunsch, , 2005; Shirkhorshidi et al., , 2015), therefore, it is recommended to perform variable standardization (that is, rescale the variable by dividing by its standard deviation) before applying this measure.
Other dissimilarity measures for numeric variables include cosine similarity measure, Pearson correlation, Mahalanobis distance, to name a few. Cosine similarity measures the angle between two vectors regardless of vector magnitudes. It is usually applied if we are not interested in magnitudes, for example, for text mining as it captures text meanings instead of counting numbers (Xu and Wunsch, , 2005; Han et al., , 2011). Pearson correlation is usually used in clustering gene expression data (Xu and Wunsch, , 2005), but it is sensitive to outliers. Mahalanobis distance is scale-invariant, and takes into account variable correlations.
When variables are all categorical, simple matching dissimilarity is usually used: where indicating whether variable are the same for individuals and .
None of above-mentioned dissimilarity measures can be applied to mixed data. Gower distance was proposed to calculate the distance between subjects with mixed types of variables. Let be a data matrix with dimensions. Let the first variables of be continuous and the to variables be multilevel categorical variables or symmetric binary variables. Let be a vector representing variable . Gower distance between individuals and is defined as:
[TABLE]
where
[TABLE]
[TABLE]
[TABLE]
Gower distance for an asymmetric binary variable is calculated differently. Asymmetry occurs when similarity within one level is perceived to be higher compared to the other level. For example, breast cancer (yes/no) could be viewed as an asymmetric binary variable since individuals with breast cancer are much more similar than those without breast cancer (which could include men and women, adolescents and elder people). If variable is an asymmetric binary variable, then the Gower distance between individuals and with respect to this variable is defined as:
[TABLE]
In practice, there is one issue in applying Gower distance: as we will later show in simulations, Gower distance tends to give much larger weights to categorical variables than to continuous ones. This is because the distance due to a categorical variable is always 0 or 1, the minimum and the maximum of possible distance values, granting categorical variables more power in distinguishing subjects.
Another distance that could be used for mixed types of variables is the distance defined in FAMD:
[TABLE]
where
[TABLE]
[TABLE]
is number of levels of categorical variable ; is proportion of category of variable ; is category of variable .
2.2 K-means-based clustering algorithms
K-means (MacQueen et al., , 1967) is the most well-known and applied clustering method in practice. The basic idea is to partition subjects with respect to minimizing the within-cluster sum of squares (WCSS). This algorithm is very efficient and has been the root of many later developed ones. It is usually used together with Euclidean distance. To cluster categorical data, K-modes (Huang, , 1998) algorithm was developed by replacing Euclidean distance with simple matching dissimilarity measure, and replacing mean with mode to represent cluster centers.
To identify clusters with mixed types of variables, the partition around medoids (PAM) (Kaufman and Rousseeuw, , 2009) has been proposed. PAM is a modification of K-means with a different definition of cluster centers. Unlike K-means which uses within-cluster mean to represent its centers, PAM uses medoids which are actual data points in the dataset. This makes defining centers of categorical variables possible. Moreover, medoids are analogous to medians and hence PAM is more robust to outliers. One drawback however is that PAM is computationally intensive and inefficient, making it less ideal for processing large data sets.
K-prototypes algorithm is another modified version of K-means with the ability of handling mixed types of variables. Its centers are called prototypes, which use within-cluster mean to represent continuous variables and mode for categorical variables. The distance between subjects and is defined as:
[TABLE]
where
[TABLE]
and is a user-defined weight parameter for categorical variables. K-prototypes lacks flexibility in variables weights as it assumes equal importance for variables of the same type. Moreover, the tuning parameter is user-defined rather than data-driven.
2.3 Hierarchical clustering
Hierarchical clustering is another category of clustering methods. It first grows a dendrogram which is a tree-like diagram showing hierarchical structure of subjects and then cuts the dendrogram to obtain clusters. One advantage of hierarchical clustering is that the generated dendrogram is very informative and provides information of cluster structure besides cluster assignments. Its disadvantages include no global objective function, a greedy type of procedure, the sensitivity to outliers, and inefficient for large data sets.
2.4 Extended clustering framework
In many situations researchers are also often interested in variables’ importance, not just cluster identification. Motivated by this interest, sparse clustering framework (Witten and Tibshirani, , 2010) was proposed. It incorporates feature selection through a Lasso-type penalty, and adds variable weights to the objective function: {maxi*} w;Θ ∈D∑_j=1^pw_j f_j(X_j;Θ)
\addConstraint∥w∥^2⩽1,∥w∥_1⩽s, w_j⩾0 * * ∀j,
where is number of subjects; is number of features; is the weight vector; is a parameter vector restricted to lie in a set ; is some function that involves feature only; and is a norm restriction, which is a tuning parameter in the algorithm. We could plug in many algorithms like K-means, hierarchical clustering into this framework to obtain sparse version algorithms. One of the main attractions of sparse clustering is that it conducts data-driven variable selection and clustering simultaneously. However, the selection of tuning parameter may not be straightforward.
Many partition-based algorithms require pre-specification of , the optimal number of clusters, but how to choose it is another important question. The consensus clustering framework (Monti et al., , 2003; Wilkerson and Hayes, , 2010) can help determine number of clusters and obtain cluster memberships simultaneously. In addition, it can assess stability of discovered clusters. Consensus clustering incorporates results from multiple runs of an inner-loop clustering algorithm (e.g., K-means) on sub-sampled subjects. For each pair of subjects, a consensus index is obtained by calculating proportion of times the pair was assigned to the same cluster among times both pair members were sampled. The consensus index can then serve as a similarity measure and subjected to a hierarchical clustering algorithm to form final clusters. Choosing is achieved by checking the consensus matrix heatmaps and cluster-consensus values. The number of clusters that yields the cleanest heatmap and highest cluster-consensus values is preferred.
2.5 Density-based clustering
Another important category of clustering methods is density-based clustering. All above-mentioned algorithms are distance-based methods which are more appropriate for detecting clusters that are convex shaped and with similar sizes and densities. If the underlying clusters have arbitrary shapes, density-based clustering algorithms may work better. Density-based spatial clustering of applications with noise (DBSCAN) (Ester et al., , 1996) and ordering points to identify the clustering structure (OPTICS) (Ankerst et al., , 1999) are two widely used density-based algorithms. DBSCAN does not need input of , and it is robust to noise. However, it is not well suited for high dimensional data or for clusters with varying densities. OPTICS is an improved method which can detect clusters with varying densities while not being over sensitive to its user-specified tuning parameters.
2.6 Model-based clustering
FMM is a model-based clustering method assuming that the data is consists of latent clusters. Its density function is defined as:
[TABLE]
where is the cluster mixture probability, ; and is the conditional distribution given cluster . For a sample of size , the log-likelihood can be written as:
[TABLE]
FMM assumes conditional independence given cluster , that is, , and the EM algorithm is usually used to obtain the MLE. The posterior probability of each subject belonging to each cluster can be calculated as:
[TABLE]
Subjects are then assigned to the cluster with which the posterior probability is the largest. These probabilities help discriminate core subjects (those with high probability of belonging to assigned cluster) and border subjects (those with low probability of belonging to assigned cluster) within each cluster. Given the parametric form of FMM, formal inference is possible. In addition, selecting the number of clusters becomes a model selection problem. The main drawback however is its unverifiable distributional assumptions; all the inferences are conducted conditional on unknown cluster assignments.
There are some other approaches to handle mixed types of variables. These include categorizing all continuous variables (Haripriya et al., , 2015) or converting categorical variables into continuous or dummy variables and then treat the dummy variables as continuous (Hennig and Liao, , 2013). However, both ideas will lead to information loss. Another common idea is to cluster continuous part of the data and categorical part separately. The final clusters are obtained by ensembling these two sets of clustering results (Reddy and Kavitha, , 2012). This method impractically weigh continuous and categorical variables equally and ignore possible mutual influences between the two variable types.
3 Proposed hybrid density- and partition-based clustering (HyDaP) algorithm for mixed data
To address the limitations of the existing clustering methods in handling data containing mixed types of variables, we propose a hybrid density- and partition-based clustering (HyDaP) algorithm which consists of a pre-processing step (step 1) and a clustering step (step 2). The pre-processing step identifies the data structure formed by continuous variables and recognizes the important variables for clustering. In the clustering step, our proposed dissimilarity measure is used to obtain a dissimilarity matrix, which can be fed into PAM to obtain the final results. We describe the HyDaP algorithm in detail below.
3.1 Pre-processing step (Step 1)
To help with variable selection and better understand the data set, we first define 3 data structures for the space spanned by the continuous variables as: natural cluster structure (data structure 1); partitioned cluster structure (data structure 2); and homogeneous structure (data structure 3). Once the data structure is known, we apply tailored variable selection procedures. At the end of the pre-processing step, a set of selected variables will proceed to the clustering step (step 2) (Figure 1).
3.1.1 Data structure identification
Data spanned in the covariate space of continuous variables can be divided into two scenarios: with and without natural clusters. A hypothetical example of these two scenarios is depicted in Figure 2. We can observe that both Data 1 and Data 2 contain two variables, but natural clusters only exist in Data 1. Although this conclusion is straightforward for Data 1 and Data 2, when data is spanned in a high-dimensional space, it is impossible to visually examine existence of natural clusters. Therefore, we use a density-based clustering algorithm (e.g., OPTICS) and resulted reachability plot to help understand the spatial structure of the data. Reachability plot is a bar plot showing ordered reachability distances among subjects (Ankerst et al., , 1999). A reachability plot provides an overall 2-dimensional spatial structure of a dataset regardless of its original dimensions. The horizontal axis of the plot is the processing order and the vertical axis is the reachability distance. Each trough on the reachability plot can be viewed as a single cluster. Edges between two side-by-side troughs represent the distance between two closest border points from the corresponding two clusters. Higher edges imply that the corresponding two clusters are farther apart while lower edges or unclear edges imply that clusters are not that distinct from each other.
If we observe multiple troughs in a reachability plot, as illustrated in reachability plot of Data 1 in Figure 2, this indicates existence of distinct clusters, i.e., the corresponding dataset has natural clusters. We call this type of structure natural cluster structure (data structure 1) and aim to identify these distinct clusters. If we only observe one trough or no clear through in the reachability plot (e.g., reachability plot of Data 2 in Figure 2), this indicates that distinct clusters do not exist. Then we will investigate whether data points in the continuous covariate space are sufficiently heterogeneous to be further partitioned. We use consensus clustering framework for all continuous variables to access the possible heterogeneity by checking the selected optimal number of clusters. If we obtain clusters in consensus clustering, this indicates that heterogeneity exists and we can obtain stable clusters through partitioning. We call this type of structure partitioned cluster structure (data structure 2). If the optimal number of clusters is one from the consensus clustering results, this indicates that continuous part of the data is highly homogeneous and cannot be further partitioned. We call this type of structure homogeneous structure (data structure 3).
3.1.2 Variable selection
After identifying the data structure, we conduct data structure tailored variables selection.
Under the natural cluster structure, distinct clusters can be determined by continuous variables. Therefore, we would like to select those having high contributions. As shown in Figure 1, we apply sparse K-means on all continuous variables and keep those with high weights (suggestions of the weight threshold can be found in Section 3.3.2). Number of clusters under this structure can be determined by the number of troughs in the reachability plot. Next, we calculate Cramer’s V between each categorical variable and the cluster membership obtained from sparse K-means. We will only select categorical variables with high Cramer’s V values. Cramer’s V has been used to measure the association between nominal variables. It ranges from 0 to 1. A larger number indicates a stronger association, vice versa. Unlike the p-value, Cramer’s V is not affected by the sample size. Researchers suggested the use of 0.3 as the cutoff value, namely Cramer’s V larger than 0.3 indicates a moderate to strong association.
Under the partitioned cluster structure, distinct clusters do not exist; however, covariate space of all continuous variables are sufficiently heterogeneous to be further partitioned. This structure indicates that all of the continuous variables together contribute to heterogeneity but none of them has the driving influence. Therefore, we keep all continuous variables and run consensus K-means to select the optimal number of clusters. Next, we calculate Cramer’s V between each categorical variable and the cluster membership obtained from consensus K-means. We will only select categorical variables with high Cramer’s V values.
Under the homogeneous structure, no distinct cluster exists and we are not able to further partition continuous covariate space into homogeneous subgroups. Therefore, we dropped all continuous variables as they are non-distinguishable across clusters. Next, we calculate pairwise Cramer’s V values among categorical variables and only select pairs with high Cramer’s V values.
3.2 Clustering step (Step 2)
After variables with high contributions are selected, we proceed to the final clustering step. This step is the same across all data structures. We calculate the dissimilarities between subjects using our proposed dissimilarity measure, a modified version of the Gower distance. Assume that the first variables are continuous and the rest are categorical. Our proposed dissimilarity between subjects and is defined as:
[TABLE]
where
[TABLE]
[TABLE]
[TABLE]
Our modification is based on the idea of standardization to avoid variables with high variability be extremely influential to clustering results. It is motivated by the definition of Gower distance for categorical variables as they receive extreme dissimilarity values [math] or , which could exhibit high variability. This allows them to exert greater influence in the clustering results even if they are less informative than the continuous ones.
Below we show how our modification on dissimilarities is analogous to the standardization on continuous variables. Standardized squared Euclidean distance between subjects and with respect to a continuous variable is:
[TABLE]
which can be re-written as:
[TABLE]
where the numerator is the original squared Euclidean distance, the denominator is proportional to the sum of all pairwise distances. We adopt this idea to standardize the Gower distance, namely we divide the original Gower distance of variable by sum of all pairwise Gower distance of variable as shown above.
If after the pre-processing step all selected variables are continuous, we can just apply usual clustering methods to obtain the final clustering results.
3.3 Parameters selection
In this section we provide general suggestions on the selection of (1) the optimal number of clusters; (2) continuous variables under natural cluster structure.
3.3.1 Number of clusters
Under the natural cluster structure, the number of clusters can be decided by the number of troughs in the reachability plot. Under the partitioned cluster structure, the number of clusters can be selected from the results of the consensus clustering. Under the homogeneous structure, we only select categorical variables in determining cluster membership. Hence we suggest constructing a dissimilarity matrix using our proposed dissimilarity measure and then plot the number of clusters against the corresponding within-cluster sum of dissimilarities. In this plot, we look for an elbow for the optimal number of clusters.
3.3.2 Selecting continuous variables under the natural cluster structure
Selection of the continuous variables with high weights under the natural cluster structure could be subjective because of the choice of the weight threshold. We suggest applying sparse K-means for continuous part of each bootstrapping data set and then calculate the between-cluster sum of squares (BCSS). We then order these variables by their median BCSS from the smallest to the largest and plot the median (with quantile and quantile interval) of BCSS. Then we drop variables whose BCSS values are small or far away from the others. Our suggestion here is a heuristic one. Users can always incorporate other information and make their own judgements.
4 Simulation studies
In this section we use simulations to evaluate the performance of the HyDaP algorithm relative to the existing approaches. Assuming that there are 3 underlying true clusters with cluster sizes of 40, 40, and 120. In terms of variable importance, we considered scenarios (1) both variable types contribute to clustering, (2) only continuous variables contribute to clustering, and (3) only categorical variables contribute to clustering. In terms of data structures, all 3 data structures were covered in simulations. Details of the distributions and parameters used in these simulation settings are shown in Table 1.
For each setting, 500 datasets were generated. Cluster analysis was performed on each dataset using the proposed HyDaP algorithm. We compared its performance with PAM with Gower distance, K-prototypes, FMM, and PAM with FAMD distance. Since we know the true cluster labels, the adjusted rand index (ARI) was calculated and used to evaluate the performances of different methods. ARI is used to measure the agreement between two nominal variables. Its largest value is 1 indicating perfect agreement and its smallest value is close to 0 indicating no agreement. For the purpose of evaluating clustering performance in simulations, higher ARI values indicate better agreement with true cluster labels and hence better performance. The reachability plot for each setting is illustrated in Figure 3. Table 2 summarizes the results of the pre-processing step of the HyDaP algorithm. The clustering performance with respect to ARI across all simulation settings is shown in Table 3. To examine the impact of conditional correlation on clustering performance, each simulation setting was imbued with a pairwise correlation of 0.4 conditional on true cluster labels. Results are shown in Table 4. Median along with the and percentiles were reported for all statistics.
4.1 Setting 1: Both types of variables contribute to clustering
4.1.1 Natural cluster structure
In simulation 1(a), we simulated a total of 5 variables: 4 continuous and 1 categorical. All except one continuous variable truly contribute to clustering. The sole categorical variable also contributes to clustering.
In Step 1 of the HyDaP algorithm, the reachability plot (Figure 3(a)) indicated 3 clusters. Therefore, this setting has natural cluster structure as 3 distinct clusters exist. Table 2 shows the very low contribution of from the sparse K-means and the strong association between and the clusters identified by the sparse K-means. We dropped and kept all the others.
In Step 2, we applied PAM along with the proposed dissimilarity measure on the selected variables from Step 1: , , , and .
As shown in Table 3, HyDaP algorithm performed very well [ARI: 0.97 (0.92, 1.00)]. Although K-prototypes (ARI: 1.00 [0.96, 1.00]) and FMM (ARI: 1.00 [0.98, 1.00]) both performed slightly better, our HyDaP algorithm was able to identify important variables. PAM with Gower distance (ARI: 0.70 [0.58, 0.80]) and PAM with FAMD distance (ARI: 0.78 [0.66, 0.89]) performed poorly. This is not surprising of the results using the Gower distance since it tends to downplay contributions of continuous variables, although in this setting continuous variables to all have large contributions to clustering.
4.1.2 Partitioned cluster structure
In simulation 1(b), we simulated a total of 14 variables: 11 continuous and 3 categorical. Six out of eleven continuous variables truly contribute to clustering; two out of three categorical variables contribute to clustering.
In Step 1 of the HyDaP algorithm, Figure 3(b) indicated that no natural clusters exists. After conducting consensus K-means, we chose 3 as the optimal number of clusters as its corresponding cluster-consensus values were the largest. Thus, a partitioned cluster structure was identified. All continuous variables were retained for the next step. Variable was dropped because of its small Cramer’s V with cluster assignments obtained in consensus K-means.
In Step 2, PAM with proposed dissimilarity measure was applied on , ,…, , , and to obtain final results.
Performance of the HyDaP algorithm is satisfactory (ARI 0.95 [0.87, 1.00]). Although it was unable to eliminate continuous variables that are purely noise, the HyDaP algorithm revealed that no continuous variable has driving effect but all of them together lead to heterogeneity in the feature space spanned by all of these continuous variables. In this setting, K-prototypes (ARI: 0.93 [0.79, 1.00]) and PAM with FAMD distance (ARI: 0.93 [0.84, 0.98]) also worked well while performance of FMM varied widely from sample to sample (ARI: 0.98 [0.44, 1.00]). PAM with Gower distance did not perform as well as others (ARI: 0.87 [0.76, 0.96]). This is because a noise categorical variable was included and Gower distance tends to amplify its contribution.
4.2 Setting 2: Only continuous variables contribute to clustering
4.2.1 Natural cluster structure
In simulation 2(a), we simulated a total of 5 variables: 4 continuous and 1 categorical. This setting is the same as simulation 1(a) except that the sole categorical variable does not contribute to clustering.
In Step 1 of the HyDaP algorithm, was dropped due to its low contribution in the sparse K-means. Table 2 shows a weak association between the categorical variable and clusters identified by the sparse K-means.
In Step 2, we applied the sparse K-means on , , and as they are all continuous variables.
In this setting, the HyDaP algorithm (ARI: 0.98 [0.94, 1.00]) and K-prototypes (ARI: 0.98 [0.92, 1.00]) both worked well. There were a few simulation runs the performance of FMM was not satisfactory (ARI: 1.00 [0.56, 1.00]). PAM with Gower distance (ARI: 0.01 [-0.01, 0.04]) and PAM with FAMD distance (ARI: 0.09 [-0.01, 0.44]) performed extremely poor. As mentioned in simulation 1(b), Gower distance tends to amplify the contributions of the categorical variables. Meanwhile, FAMD was not originally designed for clustering.
4.2.2 Natural cluster structure
In simulation 2(b), we simulated a total of 8 variables: 5 continuous and 3 categorical. Four out of five continuous variables truly contribute to clustering and follow highly skewed distributions. None of the categorical variables contributes to clustering.
In Step 1 of the HyDaP algorithm, Figure 3(d) shows 3 distinct clusters and hence this setting was identified as natural cluster structure. We dropped because of its small contribution to clustering as shown in Table 2. All categorical variables were dropped as well given their weak associations with clusters obtained in the sparse K-means.
In Step 2, we applied the sparse K-means on , , and since they are all continuous variables.
In this setting, the HyDaP algorithm performed the best (ARI: 0.98 [0.92, 1.00]). PAM with Gower distance (ARI: 0.23 [0.00, 0.34]), K-prototypes (ARI: 0.58 [0.38, 0.99]), FMM (ARI: 0.41 [0.33, 0.58]), and PAM with FAMD distance (ARI: 0.34 [0.08, 0.39]) all performed poorly. This was expected for FMM because most of the continuous variables were not normally distributed conditional on the true cluster labels.
4.3 Setting 3: only categorical variables contribute to clustering
4.3.1 Homogeneous structure
In simulation 3, we simulated a total of 7 variables: 4 continuous and 3 categorical. None of the continuous variables truly contributes to clustering. Two out of three categorical variables contribute to clustering.
In Step 1 of the HyDaP algorithm, Figure 3(e) indicates no natural clusters exist. After conducting consensus K-means, the optimal number of clusters chosen was 1 because cluster-consensus values were low for all numbers of clusters. Hence this was identified as homogeneous structure. All continuous variables were dropped but categorical variables and were kept due to their strong association with each other as shown in Table 2.
In Step 2, PAM with proposed dissimilarity measure was applied on and .
In this setting, the HyDaP algorithm performed the best (ARI: 0.75 [0.63, 0.85]) and K-prototypes did the worst (ARI: 0.17 [-0.01, 0.26]). Performance of PAM with Gower distance (ARI: 0.71 [0.31, 0.84]), FMM [ARI: 0.72 (0.56, 0.85)] and PAM with FAMD distance (ARI: 0.73 [0.22, 0.84]) were similar.
4.4 Variables are conditionally correlated
To assess the impact of within-cluster correlation, simulations for each of the 5 settings above was repeated with pairwise correlation of 0.4 for all continuous variables conditional on true cluster labels. As summarized in Table 4, within-cluster correlation had little to no impact on the performance of the HyDaP algorithm, PAM with Gower distance, K-prototypes, and PAM with FAMD distance. In some situations, it led to worse performance of FMM. This is expected since FMM assumes conditional independency, namely all variables are independent with each other conditional on clusters labels. However, we did observe that in simulation 3 when none of the continuous variables contributes to clustering, the optimal number of clusters selected by the consensus K-means was 2 instead of 3 (figures not shown here). This is understandable since all pairs of continuous variables are correlated given true cluster labels, therefore, they share a lot of common information. To some extent we can use only one of them without losing much information as all others as redundant. For any single continuous variable we can potentially divide it into 2 subgroups that have some differences. But this does not essentially mean these 2 subgroups can be viewed as 2 clusters. Therefore, if we observe that 2 is the optimal number of clusters in consensus clustering results and most pairs of continuous variables have high conditional correlations, we should be cautious. One suggestion is that we could try to look for continuous variables that have similar clinical meanings e.g., Aspartate Aminotransferase (AST) and Alanine Aminotransferase (ALT), since these variables are very likely to have high correlations within clusters. For these variables we can only keep one of them in clustering to avoid such situation.
4.5 Simulation summary
From the simulation studies, we found that our proposed HyDaP algorithm was consistently the top or one of the top performers across all simulation settings. Moreover, we found that (1) when categorical variables do not contribute much to clustering, PAM with Gower distance performed poorly; (2) when continuous variables follow arbitrary distributions, FMM may not perform well due to assumption violation; (3) when none of continuous variables contributes to clustering, K-prototypes may fail; (4) performance of PAM with FAMD distance was not stable across different scenarios as its distance measure is not specifically designed for clustering.
5 Real data application
We used the EHR data collected from the Sepsis ENdotyping in Emergency CAre (SENECA) project to demonstrate the use of our proposed HyDaP algorithm for identifying phenotypes in patients with sepsis. The SENECA data contains 20,189 sepsis encounters collected from 12 University of Pittsburgh Medical Center (UPMC) healthcare systems from year 2010 to 2012. The goal is to explore whether clinical sepsis phenotypes are identifiable for a patient that presents at the emergency department and whether the phenotypes are associated with various clinical endpoints. The objectives of the analysis are to select the most important variables among 30 demographic and clinical variables, and to identify homogeneous clusters (phenotypes). Twenty eight variables were continuous and 2 were categorical. Although we do not have much information about the optimal number of clusters for the data set, our clinician colleagues suggested that larger numbers of clusters are preferred.
Data structure identification: The reachability plot in Figure 4 indicates that there is no natural clusters in the SENECA data. Unlike the genetic data, we rarely observe natural clusters in data collected from clinical settings. We then performed the consensus K-means for all continuous variables, and the results are depicted in Figure 5 suggesting that the optimal number of clusters is , which indicates that the data structure of the SENECA data belongs to partitioned cluster structure.
Variable selection: Under partitioned cluster structure we kept all continuous variables. For categorical variables, Cramer’s V is 0.05 between gender and cluster membership from consensus clustering and it is also 0.05 between race and cluster membership. Therefore, we dropped gender and race before proceeding to the final clustering step.
Clustering step: All categorical variables were excluded after the pre-processing step, so we took the results from the consensus K-means as our final clustering results. In terms of variable importance, all continuous variables together had contributions to the obtained partitions but none of them showed dominant impact. Neither gender nor race were important clustering variables. We obtained 4 clusters with relatively balanced sample sizes: , , , and . Within each cluster, distributions of some important clinical endpoints are shown in top left plot of Figure 6. We can observe that Cluster 1 has the lowest proportion for all clinical endpoints while Cluster 2 has the second lowest ones. Cluster 4 has the highest proportions. With our clinician colleagues, we examined patient characteristics of the resulting clusters. We observed that sepsis patients in Cluster 1 had fewer other health issues; patients in Cluster 2 were those who were older, had multi morbidities, and renal dysfunctions; patients in Cluster 3 were those who had more inflammations and pulmonary dysfunctions; and patients in Cluster 4 were whose who had more acidosis, liver, and cardiovascular dysfunctions.
For comparison, we applied PAM with Gower distance, K-prototypes, and FMM to obtain cluster memberships assuming 4 clusters. The results are summarized in Figure 6. For PAM with Gower distance, we took a random sample of the whole SENECA data with size because the computation time of this algorithm was very long. After further exploration we found that gender dominated the clustering result as the proportion of male is in Cluster 1, in Cluster 4, in Cluster 2, and in Cluster 3. Note that gender was not relevant based on our proposed HyDaP algorithm. For the K-prototypes, we found that the 4 clusters obtained were not that distinct from each other in terms of the distribution of clinical endpoints. The 4 clusters obtained from the FMM appeared to be distinct from each other and similar to what we observed in HyDaP algorithm. However, Cluster 1 has larger proportion of patients admitted to ICU, use of mechanical ventilation and vasopressor compared to Cluster 2, but it has lower mortality rate.
Next, we re-applied the existing methods by first selecting the method-specific optimal number of clusters. The number of clusters versus the total WCSS or BIC values are shown in left column of Figure 7. We found that the optimal number of clusters was 2 for PAM with Gower distance and for K-prototypes, and 3 for FMM. We once again observed that the clustering results were dominated by gender when using PAM with Gower distance. The proportion of men was 1.2% in Cluster 1 and 98.8% in Cluster 2. The two clusters were quite similar in terms of clinical endpoints. Similarly, the 2 clusters identified by K-prototypes were not quite distinct in terms of clinical endpoints as well. The FMM identified 3 clusters with quite different distribution of clinical endpoints, but the HyDaP algorithm was able to identify one more cluster with distinct clinical features.
6 Discussion
We proposed a hybrid density- and partition-based clustering (HyDaP) algorithm to conduct variable selection and identify clusters in data consisting of mixed types of variables. Our algorithm involves a pre-processing step to identify the data structure formed by continuous variables and to select important variables, and a clustering step to determine the cluster membership. In the clustering step, we proposed a dissimilarity measure that balances the contributions between continuous and categorical variables, which the existing clustering methods do not offer. Through simulation studies, we showed that the proposed HyDaP algorithm is robust to different data structures and outperforms or at par commonly used methods. We also defined 3 different data structures to help understand data and better interpret clustering results. Our method successfully identified four clinically meaningful sepsis phenotypes for data extracted from EHR of multiple health care systems. The resulting phenotypes are highly associated with several clinical endpoints.
Our HyDaP algorithm inherits the limitations of sparse K-means. For data under the natural cluster structure, if the continuous variables contain many outliers or excessive zeros (a.k.a. zero-inflated), the sparse K-means procedure cannot correctly identify variables with high contributions. Another situation that could affect the later steps in our method and lead to unsatisfactory results is that when data contains continuous variables of the same value for the majority of subjects while other few have different values. We also suggest checking variables that have similar clinical meanings or highly clinically related before clustering and only keep one of them to avoid existence of within-cluster correlations.
Clustering has emerged as one of the essential and popular techniques for discovering patterns in data or disease phenotypes. Although clustering methods keep evolving to cope with increasing complexity in data, certain features in data sets could limit the utilization of the existing approaches. Unlike genetics or genomics data, data collected from clinical settings often include various types. Our proposed algorithm overcomes the drawbacks of the commonly used clustering algorithms therefore the results from using our method may be more helpful to clinicians in making good medical decisions.
Appendix A
Variables used in SENECA data analysis
Age
Gender: categorical variable; 2 levels (male/female)
Race: categorical variables; 3 levels (white/black/hispanic)
Maximum temperature within 6 hours of ER presentation
Maximum heart rate within 6 hours of ER presentation
Minimum systolic blood pressure within 6 hours of ER presentation
Maximum respiration rate within 6 hours of ER presentation
Maximum albumin within 6 hours of ER presentation
Maximum Cl within 6 hours of ER presentation
Maximum erythrocyte sedimentation rate (ESR) within 6 hours of ER presentation
Maximum hemoglobin within 6 hours of ER presentation
Maximum bicarbonate within 6 hours of ER presentation
Maximum Sodium within 6 hours of ER presentation
Minimum Glasgow Coma Scale (GCS) within 6 hours of ER presentation
Elixhauser Score
Maximum white blood cell within 6 hours of ER presentation
Maximum bands within 6 hours of ER presentation
Maximum creatinine within 6 hours of ER presentation
Maximum bilirubin within 6 hours of ER presentation
Maximum troponin within 6 hours of ER presentation
Maximum lactate within 6 hours of ER presentation
Maximum alanine aminotransferase (ALT) within 6 hours of ER presentation
Maximum aspartate aminotransferase (AST) within 6 hours of ER presentation
Maximum C-reactive protein within 6 hours of ER presentation
Maximum international normalized ratio (INR) within 6 hours of ER presentation
Maximum glucose within 6 hours of ER presentation
Maximum Platelets within 6 hours of ER presentation
Maximum blood urea nitrogen (BUN) within 6 hours of ER presentation
Oxygen saturation (SaO2)
Minimum partial pressure of oxygen (PaO2) within 6 hours of ER presentation
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Ankerst et al., (1999) Ankerst, M., Breunig, M. M., Kriegel, H.-P., and Sander, J. (1999). Optics: ordering points to identify the clustering structure. In ACM Sigmod record , volume 28, pages 49–60. ACM.
- 2Coorevits et al., (2013) Coorevits, P., Sundgren, M., Klein, G. O., Bahr, A., Claerhout, B., Daniel, C., Dugas, M., Dupont, D., Schmidt, A., Singleton, P., et al. (2013). Electronic health records: new opportunities for clinical research. J INTERN MED , 274(6):547–560.
- 3Council et al., (2011) Council, N. R. et al. (2011). Toward precision medicine: building a knowledge network for biomedical research and a new taxonomy of disease . National Academies Press.
- 4Ester et al., (1996) Ester, M., Kriegel, H.-P., Sander, J., Xu, X., et al. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In Kdd , volume 96, pages 226–231.
- 5Gower, (1971) Gower, J. C. (1971). A general coefficient of similarity and some of its properties. Biometrics , pages 857–871.
- 6Han et al., (2011) Han, J., Pei, J., and Kamber, M. (2011). Data mining: concepts and techniques . Elsevier.
- 7Haripriya et al., (2015) Haripriya, H., Amrutha, S., Veena, R., and Nedungadi, P. (2015). Integrating apriori with paired k-means for cluster fixed mixed data. In Proceedings of the Third International Symposium on Women in Computing and Informatics , pages 10–16. ACM.
- 8Häyrinen et al., (2008) Häyrinen, K., Saranto, K., and Nykänen, P. (2008). Definition, structure, content, use and impacts of electronic health records: a review of the research literature. INT J MED INFORM , 77(5):291–304.
