A Novel Multiple Classifier Generation and Combination Framework Based on Fuzzy Clustering and Individualized Ensemble Construction
Zhen Gao, Maryam Zand, Jianhua Ruan

TL;DR
This paper introduces ICE, a new individualized ensemble method that groups training data into overlapping clusters, builds classifiers for each, and predicts test instances by leveraging the most similar training instances, improving classification stability.
Contribution
The paper presents a novel framework combining fuzzy clustering and individualized ensemble construction, enhancing classifier robustness and adaptability across diverse datasets.
Findings
ICE outperforms existing MCS methods on many benchmarks.
It demonstrates stable improvements across 49 datasets.
The approach is versatile and easily integrable with various models.
Abstract
Multiple classifier system (MCS) has become a successful alternative for improving classification performance. However, studies have shown inconsistent results for different MCSs, and it is often difficult to predict which MCS algorithm works the best on a particular problem. We believe that the two crucial steps of MCS - base classifier generation and multiple classifier combination, need to be designed coordinately to produce robust results. In this work, we show that for different testing instances, better classifiers may be trained from different subdomains of training instances including, for example, neighboring instances of the testing instance, or even instances far away from the testing instance. To utilize this intuition, we propose Individualized Classifier Ensemble (ICE). ICE groups training data into overlapping clusters, builds a classifier for each cluster, and then…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace and Expression Recognition · Machine Learning and Data Classification · Text and Document Classification Technologies
A Novel Multiple Classifier Generation and Combination Framework Based on Fuzzy Clustering and Individualized Ensemble Construction
Zhen Gao
Department of Computer Science
*University of Texas at San Antonio
*San Antonio, United States
Maryam Zand
Department of Computer Science
*University of Texas at San Antonio
*San Antonio, United States
Jianhua Ruan*
Department of Computer Science
*University of Texas at San Antonio
*San Antonio, United States
Abstract
Multiple classifier system (MCS) has become a successful alternative for improving classification performance. However, studies have shown inconsistent results for different MCSs, and it is often difficult to predict which MCS algorithm works the best on a particular problem. We believe that the two crucial steps of MCS - base classifier generation and multiple classifier combination, need to be designed coordinately to produce robust results. In this work, we show that for different testing instances, better classifiers may be trained from different subdomains of training instances including, for example, neighboring instances of the testing instance, or even instances far away from the testing instance. To utilize this intuition, we propose Individualized Classifier Ensemble (ICE). ICE groups training data into overlapping clusters, builds a classifier for each cluster, and then associates each training instance to the top-performing models while taking into account model types and frequency. In testing, ICE finds the most similar training instances for a testing instance, then predicts class label of the testing instance by averaging the prediction from models associated with these training instances. Evaluation results on 49 benchmarks show that ICE has a stable improvement on a significant proportion of datasets over existing MCS methods. ICE provides a novel choice of utilizing internal patterns among instances to improve classification, and can be easily combined with various classification models and applied to many application domains.
Index Terms:
Classification, Multiple classifier system, Ensemble Learning
I Introduction
Multiple classifier system (MCS), including ensemble classifiers and mixture of experts, has established itself as an effective and practical solution to address challenges in supervised learning, such as functional complexity, insufficient training data, high dimensionality of feature space, and noise in training data, among others. Many excellent comprehensive reviews on MCS algorithms are available [1, 2, 3].
Learning a MCS usually includes two critical steps: base classifier generation, and multiple classifier combination, although sometimes the two steps are intrinsically integrated. Different MCS methods can be distinguished by how these two steps are performed. According to model generation strategies, existing MCS methods usually fall into one of the following two categories: random methods and deliberate methods. The former generates models by injecting random perturbations into the training data or training process [4, 5]. In contrast, the latter attempts to generate multiple classifiers in a more systematic, principled way, e.g., by iteratively re-weighting the training instances with emphasis on previously misclassified instances, a technique known as boosting [6], or by first clustering the training instances and then learning submodels from each cluster [7, 8]. According to model combination strategies, MCS methods can also be grouped into two categories: voting-based and learning-based. Most popular ensemble methods (e.g., bagging and boosting) take a (weighted) voting from all models in the pool. Other methods attempt to learn a high-level model in order to determine which model(s) should be selected for the prediction task, or to learn a more complex function to combine the outputs of all models in the pool. Learning-based model combination algorithms include stacking, dynamic model selection, among many others [9, 10].
Overall, ensemble approaches combining randomized model generation and voting (e.g. bagging and random forest) have been more successful / popular, probably due to their simplicity and less over-fitting. On the other hand, it has been shown that careful integration of deliberate models and learning-based model combination can be very effective on specific problem domains [11]. In particular, empirical studies suggest that many classification problems consist of subdomains, which can potentially benefit from constructing and selecting submodels [12, 7, 8]. The challenge, however, lies in whether these subdomains can be corrected identified at training, and whether the submodels can be correctly selected for individual cases at prediction time.
Here, we design a general MCS framework, Individualized Classifier Ensemble (ICE), with two key ideas. First, it constructs a large pool of submodels that have low bias when applied to appropriate instances. This is achieved by applying a strong learner (in contrast to the high-bias, low-variance models commonly used in a few ensemble methods) to individual overlapping clusters of instances that represent possible subproblems. Second, a simple yet effective, learning-free method is used to obtain different combinations of submodels for different testing instances. The learning-free nature of the method reduces the chance of selecting wrong models, therefore ensures that the combination of the selected submodels is better than, or at least no worse than, an average of all submodels.
Experimental results on 49 datasets from different domains show that ICE consistently outperforms the competing methods. Furthermore, detailed component analysis shows that both steps of our algorithm have positive contributions as expected. In addition, analysis of the submodels can shed light on the internal structure of the problem, which can potentially be used to further increase prediction performance, or to improve mechanistic understanding of the problem. The framework can be easily combined with existing classifiers and applied to many domains.
II Methods
Fig. 1 shows a brief overview of ICE, which starts with generating a pool of diverse and subdomain-representative classifiers from subsets of training instances (Algorithm 1), obtained by a graph-based clustering method that can detect overlapping clusters (Algorithm 2). Then, these classifiers are associated with individual training instances based on their relative prediction performance on the instance, taking into account model types and frequency (Algorithm 3). In testing/prediction stage, the nearest neighbors of a test instance are identified from the training dataset and the classifiers associated with these neighboring instances are selected to form an ensemble for prediction (Algorithm 4). While the general ICE framework is flexible and the individual components can be re-designed with domain-specific information, several design principles are crucial and are discussed below.
Source code and data are available at https://github.com/ds-utilities/ICE.
II-A Training
II-A1 Basic notations
We define a dataset of training instances as , where is an dimensional feature vector and is the binary label of instance . The clustering result on is denoted as ; is the th cluster; is the total number of clusters. Here we designate the last cluster of to be the whole set of instances. Without loss of generality, we assume the class labels are binary.
II-A2 Graph-based Fuzzy Clustering
As clustering can be subjective and unstable, we recommend generating a large number of relatively independent but overlapping clusters. In addition, each cluster needs to have a sufficient number of instances to learn a strong submodel for that subdomain. In our design, we use a graph-based clustering algorithm that chooses a set of furthest points to initiate a random walk process and use probability cutoffs to control cluster size (Algorithm 2).
The algorithm works as follows. We first calculate an instance-instance distance matrix on by Euclidean distance and store it in . Then, we construct a KNN graph by keeping the top neighbors for each node in . Afterwards, a random walk with a restart probability (default to 0.3 in this work) is performed on the KNN graph to obtain an affinity matrix, [13]. Next, a set of points, , is identified as cluster centers: from , the node with the largest total incoming probability, , is chosen as the center point of the first cluster; cluster centers for the other clusters are selected by finding the furthest node from the current center points. Finally, a probability cutoff is applied on to identify direct neighbors of each cluster center as members of the cluster, such that the average cluster size is ( as default). We designate the last cluster of to be the whole set of instances. A classifier is built using instances from each cluster.
II-A3 Associating models to instances
Incorrect model selection can significantly degrade the performance of the algorithm compared to simply averaging all submodels. When the number of training instances is relatively small, supervised learning based model selection tends to overfit. Therefore, we propose a robust learning-free method (Algorithm 3), which performs model-instance association at training time and KNN-based model selection at prediction time. Importantly, the model-instance association step takes a Bayesian approach by using different cutoffs for different types of submodels, which reflects their frequency in the pool and the probability for them to outperform other types of submodels.
Formally, given the clustering result on instances, , where is the whole set of instances, the corresponding set of models built on the clusters by a base learner (e.g., SVM) is denoted as . Here we call a model as a ‘’ model, since each model is built on a subset of the training instances, and, we call model as the ‘’ model, which is built on the whole set of instances. The class probabilities predicted by all models are stored in ; is the predicted class probability for instance by model ; is the prediction probability for instance by model built on the whole set of training instances. Note that if instance is NOT a member of cluster (in which case, we call model to be a ‘’ model of instance ), the model is directly used to predict for instance ; on the other hand, if instance is a member of cluster (in which case we call model a ‘’ model of instance ), the value is obtained by 10-fold cross-validation using instances in this cluster. This process ensures that the performance evaluation used for model-instance association is not inflated, as an instance is never evaluated by a model that used the instance in training. Importantly, by not having any designated validation dataset, we are able to keep as many instances as possible for training, an important feature for small training data.
The prediction error table, is derived from ; is the prediction error for instance by model . Each row of , , represents the prediction error of different models on instance . Given the empirical results that models usually work slightly better than model and models, as well as the fact that there are more models than models in the pool, we introduce two parameters to easily balance the proportion of , and models in the ensemble: as the advantage score of the model, and the advantage score of each model. Usually to promote the inclusion of models and demote models, unless the error in a remote model is significantly smaller than in the model. Each row of is adjusted such that , and, if . Then, the decision table, , , where indicates association between model and instance , is derived from the error table , by
[TABLE]
II-B Testing / prediction
For a test instance , ICE first finds its nearest neighbors from the training dataset, then predicts its class label by averaging the class probabilities predicted by the models associated with the neighbor training instances (Algorithm 4). Formally, the PREDICT() algorithm first selects nearest neighbors of from , and stores the indices of the neighbor instances in . Then, for each neighbor instance , the algorithm looks up in the corresponding decision table to find the models associated with the neighbor instance, and stores the associated ‘’ models of in . The number of ‘’ models in is denoted as . Note that although , is not a subset of , since may contain duplicated models. Then we denote as the ‘’ model predictions, and each is predicted by on . The predicted class probability by the whole model is denoted as . Then the predicted class probability of is calculated by:
[TABLE]
where is the parameter to balance the weight of ‘’ models and the ‘’ model; is the parameter to adjust the weight of ‘’ models based on the number of top neighbors to ensure at least one will be used in case there is no ‘’ model.
In our experiments, and are both set to 1 and N is set to 5, except in cases that we vary them to analyze the contribution of different components and the robustness of our algorithm’s performance.
II-C Relationship with Existing MCS Methods
ICE differs from most existing ensemble methods significantly in both model generation and model combination. Popular ensemble methods such as Bagging and Random Forest generate submodels using random subsets of data, and combine them using voting. In order for these methods to work effectively, a large number of submodels is needed to reduce overall bias. In contrast, ICE generates submodels to deliberately increase model diversity by clustering training instances. A carefully designed model-instance association algorithm helps identify the best ensemble for individual instances at prediction time. On the other hand, boosting generates submodels that focus on different groups of training instances, where grouping of instances is done implicitly by iterative re-weighting and therefore lack a global view of instance space. In addition, since there is no model selection at prediction time, boosting tend to overfit in the presence of noisy training instances.
Mixtures of experts is a class of neural network models attempting to simultaneously learn multiple submodels as well as a gating function that assigns each instance to one or more submodels [14, 15] . With similar idea, several methods use clustering as a preprocessing step for classification [7, 8]. These algorithms force each instance to be in a disjoint cluster, which reduces the number of instances at training time. In addition, prediction is done only by cluster-specific models so the cost of incorrect model selection is high. Empirical results presented in the original papers show mixed performance when compared to other MCS algorithms [14, 7, 8].
Finally, a series of methods have been developed recently under the common name ‘dynamic model selection’ [16, 17, 17, 18, 19, 20, 10, 21]. These approaches take an ensemble of base classifiers (e.g, from bagging), then attempt to learn a high-level classification model using, for example, instance-instance similarities and model-model correlations, as input features. While conceptually appealing, these methods tend to overfit and have poor performance when training data is limited. In our opinion, the marriage between random model generation and learning-based model combination is a poor choice, since the relatively small number of random models (compared to the possible number of instance combinations) does not guarantee that there is necessarily any predictably better submodel than a simple average of all submodels.
III Results and Discussion
III-A Data and Experimental Setup
Characteristics of the 49 benchmark datasets are shown in Figure 2. The datasets are collected from UCI machine learning repository and Kaggle Dataset for binary classification, with number of instance between 100 and 3,000, number of features between 3 and 1500, and percentage of majority class ranging from 50% to 77%. A total of 42 datasets from UCI and 23 datasets from Kaggle meet the criteria (18 of which appeared in both repositories). In addition, we add two cancer-related datasets - breast-cancer-nki [22] and breast-cancer-wang [23]. The data preprocessing mainly follows [24], which includes a -Score transformation based normalization. For nine datasets with nominal features we use two different methods to handle nominal features: (i) removing nominal features (denoted with suffix ‘-1’ in Figure 2), (ii) using One-Hot encoding (denoted with suffix ‘-2’ in Figure 2). Since all features in dataset ‘tic-tac-toe’ are nominal, this dataset only has the One-Hot encoding version. Data and source code are available at https://github.com/ds-utilities/ICE.
Performance of each classification method is evaluated by 10-fold cross validation and measured by AUC. To facilitate a simple and fair evaluation, we use common parameter values for ICE on all datasets. The number of overlapping clusters, , is set to 100, which while not ideal for all data sets, makes evaluation easier. The advantage scores for ‘’ model and ‘’ model are set to and respectively; this reflects the empirical observation that models usually have better performance than the other two types of models, and there are many models so a higher cutoff score is needed for a model to be associated with an instance. In prediction stage, the number of top neighbors parameter is set to 5; the parameter and are both set to 1 for an overall balanced ‘’ and ‘’ models in the final weighting of prediction. The base model in the evaluation is linear-SVM with the regularization parameter =1 for ICE and comparison methods Bagging and AdaBoost. Bagging and AdaBoost use 100 bags and 100 iterations respectively. It is worth noting that these parameters are chosen intuitively without extensive tuning. Parameter analysis results show that the performance of ICE is robust with regarding to a wide range.
III-B Empirical Evidence Supporting Cluster-Based Ensemble Classification
To verify our assumption that, for each testing instance, some subset of training instances may provide a better classification model than the whole set of training instances, we perform a simple experiment as follows: first, each dataset is clustered into three disjoint clusters using k-means. We denote the clusters as cluster-a, b and c respectively, with their cluster size decreasing. Then using instances in each cluster for cross-testing: we compared the prediction AUC for each cluster using instances from cluster a, b, c or the whole dataset, respectively, as training data. We adopted notation a-b to denote the situation where we use the cluster a trained model to make predictions on cluster b instances.
To have a fair evaluation, when using a larger cluster to predict a smaller cluster, we randomly select the same number of instances from the larger cluster as the size of the smaller cluster to be the training data; when use a smaller cluster to predict a larger cluster, we use all instances in the smaller cluster and randomly select some instances from the larger cluster (making sure that they are not in the fold of testing) to be the training instances, such that the total number of training instances is the same as the number of instances in the larger cluster.
From Figure 3, with only three disjoint clusters, in more than 80% of the datasets, at least one of the models can outperform the model (Figure 3a and b, columns a-a, b-b and c-c). Interestingly, while in general the models do not perform well, some of them have the largest performance gain compared to the model (column a-c and b-c). Collectively, this experiment shows the potential benefit of using a cluster of instances to improve prediction accuracy. On the other hand, the results also signifies the importance to predict, for each test instance, whether models (and which) should be used.
III-C ICE Outperforms Existing MCS Algorithms
Figure 4 shows that ICE outperforms the corresponding Bagging classifier on most datasets, and, suffers from only minor performance loss on a few datasets. Notably, ICE uses less than 100 base models - on average 45 models per prediction. ICE may still have room for improvement on failed datasets by parameter tuning and improved clustering methods. Understandably, ICE tends to have less performance gain on datasets with fewer instances, such as on datasets 1 to 6, since ICE needs more enriched instance information for a meaningful clustering. From another perspective, ICE will have advantage on datasets with more instances and with more complex instance structure.
Table 1 shows the complete AUCs of three versions of ICE (with SVM, Bagging and AdaBoost as the base model) on 49 benchmark datasets compared to multiple MCS methods, including Bagging, Adaboost, and seven dynamic model selection approaches. META-DES [16] has two versions in this evaluation, using Perceptron (the base classifier choice of the original META-DES paper) and Bagging (comparable with Bagging and ICE-Bagging) respectively. The base classifier is Bagging for the other six dynamic model selection methods - KNORA-U [17], KNORA-E [17], DES-PRC [18, 19], OLA [20], MCB [10] and A Priori [21], which is the suggested setting plus SVM to make comparable with other methods. We use the suggested parameters for dynamic model selection approaches [25].
As shown, all three versions of ICE have better performance than the other methods. The performance gain of ICE over Bagging can be attributed to the use of specifically generated models for subproblems and individualized model association and selection step. Comparing AdaBoost to ICE, both models attempt to produce subdomain-specific classifiers; however, AdaBoost always uses the same ensemble of all submodels for all instances, which reduces the potential performance gain provided by the submodel-specific models. Therefore, ICE-Adaboost and even ICE-SVM perform better than AdaBoost in general. More over, ICE outperforms the seven dynamic selection methods. Each of the dynamic selection methods has unique contributions on model selection or integration. However, none of them focuses on deliberately generating models for specific subproblems as the fuzzy clustering that ICE uses. In addition, the unique instance-model association of ICE can utilize all training instances, comparing to dynamic selection methods such as META-DES, which separates training data into META learning and dynamic selection datasets, therefore lead to more data loss and weaker base classifiers. As discussed earlier in Section 2E, learning the best combination of multiple randomly generated models can be a daunting task when the amount of training data is limited.
III-D Randomized Control Analysis Reveals The Effectiveness of Different Components of ICE
To understand the impact of the three components of ICE (C1: fuzzy clustering based model generation, C2: instance-model association, and, C3: KNN-based model selection), we perform a randomized control experiment, where one or more of the components is replaced with comparable, randomized procedures. To randomize C1, the fuzzy clustering is replaced by bootstrapping instances , where the bags are made the same size as in the fuzzy clusters, therefore resulting in a slightly modified version of Bagging. To randomize C2, the decision table is shuffled row-wise, destroying the association of models to instances. Finally, to randomize C3, KNN is replaced with random selection of instances. Note that randomizing C2 or C3 (or both) are expected to have similar impact on the algorithm, which will essentially perform random model selection (and in most cases will choose many more models than real ICE due to independence of different rows of the randomized decision table).
Figure 5 shows the performance of ICE with different components randomized. Here, in order to show the effectiveness of each component of ICE, the parameter and are set to 0, effectively eliminating ‘’ model. Not surprisingly, when both the model generation and model selection components of ICE are randomized (columns 1-3 in Figure 5a), its performance becomes similar to that of Bagging. On the other hand, when only one component is randomized (columns 4-7), ICE can still perform better than standard Bagging, although not as effective as the complete ICE algorithm (column 8), indicating that both components of ICE played a role in effective learning.
Interestingly, with only C1 randomized, our algorithm is conceptually similar to dynamic model selection [16], except that we replaced their learning-based model selection with simple KNN-based model selection. The fact that this version of ICE still outperforms dynamic model selection suggests that, with limited training data, KNN-based model selection can have more robust performance than learning-based model selection. In addition, when C2 or C3 (or both) are randomized but C1 is not randomized (column 5-7), our algorithm is conceptually similar to bagging, except that the models in the ensemble are based on clusters of instances instead of random selection of instances. As shown, this version of ICE has significant performance gain over Bagging, suggesting that, at least in these datasets, clustering-based model generation, which implicitly diversifies the models, can be better than randomized model generation.
III-E Performance of ICE is Robust in a Wide Range of Parameter Space
Figure 6 shows the results of ICE using a wide range of parameters - : number of neighbors per testing instances in prediction; : the weight advantage of the base whole model in model-instance association; : the weight advantage of the self-model in model-instance association. In this analysis, the parameter and are both set to 1 to balance and models.
Figure 6a shows that the number of nearest neighbors used in model selection has only slight impact on AUC gain on average across all 49 datasets. The recommended setting of is 5 to 10 for a balanced running speed and accuracy. ICE works best when there are strong patterns in the dataset. If ICE does not have a significant gain over Random Forest (RF) on a center dataset, a larger setting will make ICE more stable and closer to bagging. ICE still has a large room of improvement on specific dataset by using more suitable fuzzy clustering algorithm, which is one of our future work.
Figure 6b and Figure 6c shows the robust performance of ICE with respect to parameter and . A general insight of and is to set s slightly larger than , such as = 0.5, =0.4. The parameter and are quite simple to choose. Set both and to 1 will lead to a decent result for most of cases; try to set both and to 0 if there are strong clusters within the dataset, and the extreme localized classifiers may have an advantage over the basic to-go choice where .
In addition, it is worth noting that the parameters used in the experimental setup have not been tuned for individual dataset in this study. There is a potential to perform model tuning on each dataset for even more improved performance.
III-F ICE Significantly Improves Random Forest Performance
We further perform an extreme comparison between ICE (using Random Forest with 100 trees as the base classifier) and Random Forest with 10,000 trees. Random Forest (RF) is well known for its stable high performance with almost tuning-free design, and is well positioned to be a benchmark classifier. As shown in Figure 7, ICE significantly improves the performance of Random Forest; ICE wins or ties over RF on 36 out of 49 datasets (74%), and has minor performance loss on 13 datasets. The -test -value of gain = 0.018, which is significant (-value), and, there are 7 datasets (highlighted on Figure 7) with AUC gain over 8% (among these, ICE has AUC gain over 13.4% on 4 datasets), while no dataset with AUC loss over 3%. Note that ICE only uses on average 47 models per prediction, much fewer comparing to 10,000 trees by the RF classifier. Moreover, RF easily reaches its performance limits as the number of trees grows, while ICE has a much larger room of improvement as the number of submodels increases. Performance of ICE can be further improved by increasing the number of fuzzy clusters (submodels) or using more suitable clustering methods.
The performance gain of ICE over RF can be attributed to the use of specifically generated models for subproblems and individualized model association and selection step. Interestingly, the AUC gain of ICE is correlated with the result from Figure 3a - the 7 highest-scoring datasets by ICE on Figure 7 have on average 4.3 ‘’ models winning the ‘’ model, while this statistic is only 2.6 for the other datasets; the average AUC gain by ‘’ models of ICE on these 7 datasets is 0.077, while it is 0.057 for the other datasets. This results not only further validates our intuition of using ‘’ models to improve classification performance, but also suggests that the performance of ICE can be partially predictable based on dataset characteristics, which is a very important feature in practice.
III-G Classification Improved by Accurately Predicting ‘Hard’ Instances with ‘’ Models
ICE has a stable AUC gain on most of datasets over a large range of parameter variation, and the dataset with one of the most dramatic improvement using ICE is the 15-breast-cancer-1 dataset. As shown in Figure 8a, as the parameter increases, the average number of models per instance also increases and the performance of ICE continues to increase, reaching a plateau after . In addition, analysis of the models used by each test instance of ICE shows an interesting bimodal distribution: most of the test instances (262 out of 286 cases) use less than 20 models (mostly models); in contrast, a few instances (24 cases) use more than 40 unique models (including both and models) (Figure 8b), which are presumably the more difficult instances that are hard to be clustered and/or classified.
Comparing the performance difference on these two groups of instances, we can see that ICE has a much lower prediction error when compared to the‘’ model on instances with 30 models by ICE (Figure 8c). The test -value of the error differences between the‘’ model prediction and ICE prediction on instances with 30 models (24 cases) is significant (). This result demonstrates that different instances should be treated differently on this dataset, and the ICE algorithm shows a potential way of separating and treating these different instances.
III-H AUC Gain of ICE has a Strong Correlation with the Data-Decision Table Similarity
In this work, instances are clustered based on their similarities in the feature space. However, it is possible that this clustering may not be optimal in revealing model heterogeneity. A different view may be obtained by analyzing the instance-instance similarities in the model space. Therefore, we use the decision table, which describes the prediction performance of each model on each instance, to measure instance-instance similarity, and inspect whether the consistency between these two types of similarity measures can be predictive of the performance of ICE.
Indeed, as shown in Figure 9, there exists a strong positive correlation (Pearson correlation coefficient = 0.425) between AUC gain and feature-model consistency, where the consistency is defined as the Pearson correlation correlation between the instance-instance similarities measured in the feature space and the instance-instance similarities measured using the decision table entries. This result indicates the potential of improving our current work by feature selection and better clustering method on data . Our intuition is that some features are more related with a classification task than the other features, and, we should be able to use these features for clustering for the classification task rather than use all the features. This also explains that the AUC gain on dataset 15-breast-cancer-1 (3 features, AUC gain = 0.126) is much larger than the AUC gain on dataset 16-breast-cancer-2 (50 features, AUC gain = 0.024). The three features of dataset 15-breast-cancer-1 are ‘tumor-size’, ‘left or right breast’ and ‘if irradiate’, and, all the non-binary nominal features in the original breast cancer dataset from [24] has been removed, while the dataset 16-breast-cancer-2 keeps all the other nominal features by One-Hot encoding. It is reasonable to imagine that the clustering on dataset 16 is influenced by some of the over-complicated and irrelevant features (for the classification task); therefore, the models built on those clusters are not optimized for the classification task. A potential future improvement is to cluster instances based on the output values from different models instead of on the feature values, or using both in an iterative manner.
IV Conclusion
Based on the intuition that classifiers generated from different subdomains of training instances are needed in classification task, we proposed ICE, a novel multiple classifier generation and combination framework, which generally increases the diversity among submodels, and successfully associates the submodels to subdomains of instances. Evaluation results on 49 benchmarks show that our model has a stable improvement on a significant proportion of datasets over multiple existing MCS methods. A detailed component analysis shows that the different components of our algorithm work coordinately to achieve its performance. We believe that ICE can provide a novel choice of utilizing subdomain models to improve classification.
Acknowledgment
This research was supported in part by grants from the National Science Foundation (award number IIS-1218201 and ABI-1565076), and the National Institutes of Health (award number G12MD007591 and U54CA217297).
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] L. Rokach, “Ensemble-based classifiers,” Artificial Intelligence Review , vol. 33, no. 1-2, pp. 1–39, 2010.
- 2[2] L. I. Kuncheva, Combining pattern classifiers: methods and algorithms . John Wiley & Sons, 2004.
- 3[3] N. C. Oza and K. Tumer, “Classifier ensembles: Select real-world applications,” Information Fusion , vol. 9, no. 1, pp. 4–20, 2008.
- 4[4] L. Breiman, “Bagging predictors,” Machine learning , vol. 24, no. 2, pp. 123–140, 1996.
- 5[5] ——, “Random forests,” Machine learning , vol. 45, no. 1, pp. 5–32, 2001.
- 6[6] Y. Freund and R. E. Schapire, “A desicion-theoretic generalization of on-line learning and an application to boosting,” in European conference on computational learning theory . Springer, 1995, pp. 23–37.
- 7[7] R. Vilalta, M.-K. Achari, and C. F. Eick, “Class decomposition via clustering: a new framework for low-variance classifiers,” in Data Mining, 2003. ICDM 2003. Third IEEE International Conference on . IEEE, 2003, pp. 673–676.
- 8[8] L. I. Kuncheva, “Clustering-and-selection model for classifier combination,” in Knowledge-Based Intelligent Engineering Systems and Allied Technologies, 2000. Proceedings. Fourth International Conference on , vol. 1. IEEE, 2000, pp. 185–188.
