A Novel Multiple Classifier Generation and Combination Framework Based   on Fuzzy Clustering and Individualized Ensemble Construction

Zhen Gao; Maryam Zand; Jianhua Ruan

arXiv:1907.13353·cs.LG·August 1, 2019

A Novel Multiple Classifier Generation and Combination Framework Based on Fuzzy Clustering and Individualized Ensemble Construction

Zhen Gao, Maryam Zand, Jianhua Ruan

PDF

Open Access 1 Repo

TL;DR

This paper introduces ICE, a new individualized ensemble method that groups training data into overlapping clusters, builds classifiers for each, and predicts test instances by leveraging the most similar training instances, improving classification stability.

Contribution

The paper presents a novel framework combining fuzzy clustering and individualized ensemble construction, enhancing classifier robustness and adaptability across diverse datasets.

Findings

01

ICE outperforms existing MCS methods on many benchmarks.

02

It demonstrates stable improvements across 49 datasets.

03

The approach is versatile and easily integrable with various models.

Abstract

Multiple classifier system (MCS) has become a successful alternative for improving classification performance. However, studies have shown inconsistent results for different MCSs, and it is often difficult to predict which MCS algorithm works the best on a particular problem. We believe that the two crucial steps of MCS - base classifier generation and multiple classifier combination, need to be designed coordinately to produce robust results. In this work, we show that for different testing instances, better classifiers may be trained from different subdomains of training instances including, for example, neighboring instances of the testing instance, or even instances far away from the testing instance. To utilize this intuition, we propose Individualized Classifier Ensemble (ICE). ICE groups training data into overlapping clusters, builds a classifier for each cluster, and then…

Equations4

d_{ij} = {1, 0, if e_{ij} \leq e_{i L} otherwise

d_{ij} = {1, 0, if e_{ij} \leq e_{i L} otherwise

p^{t} = \frac{\sum _{i = 1}^{M} p _{i}^{p a r t ia l} + ( α M + β N ) \cdot p ^{w h o l e}}{( α + 1 ) M + β N},

p^{t} = \frac{\sum _{i = 1}^{M} p _{i}^{p a r t ia l} + ( α M + β N ) \cdot p ^{w h o l e}}{( α + 1 ) M + β N},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ds-utilities/ICE
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace and Expression Recognition · Machine Learning and Data Classification · Text and Document Classification Technologies

Full text

A Novel Multiple Classifier Generation and Combination Framework Based on Fuzzy Clustering and Individualized Ensemble Construction

Zhen Gao

Department of Computer Science

*University of Texas at San Antonio

*San Antonio, United States

[email protected]

Maryam Zand

Department of Computer Science

*University of Texas at San Antonio

*San Antonio, United States

[email protected]

Jianhua Ruan*

Department of Computer Science

*University of Texas at San Antonio

*San Antonio, United States

[email protected]

Abstract

Multiple classifier system (MCS) has become a successful alternative for improving classification performance. However, studies have shown inconsistent results for different MCSs, and it is often difficult to predict which MCS algorithm works the best on a particular problem. We believe that the two crucial steps of MCS - base classifier generation and multiple classifier combination, need to be designed coordinately to produce robust results. In this work, we show that for different testing instances, better classifiers may be trained from different subdomains of training instances including, for example, neighboring instances of the testing instance, or even instances far away from the testing instance. To utilize this intuition, we propose Individualized Classifier Ensemble (ICE). ICE groups training data into overlapping clusters, builds a classifier for each cluster, and then associates each training instance to the top-performing models while taking into account model types and frequency. In testing, ICE finds the $k$ most similar training instances for a testing instance, then predicts class label of the testing instance by averaging the prediction from models associated with these training instances. Evaluation results on 49 benchmarks show that ICE has a stable improvement on a significant proportion of datasets over existing MCS methods. ICE provides a novel choice of utilizing internal patterns among instances to improve classification, and can be easily combined with various classification models and applied to many application domains.

Index Terms:

Classification, Multiple classifier system, Ensemble Learning

I Introduction

Multiple classifier system (MCS), including ensemble classifiers and mixture of experts, has established itself as an effective and practical solution to address challenges in supervised learning, such as functional complexity, insufficient training data, high dimensionality of feature space, and noise in training data, among others. Many excellent comprehensive reviews on MCS algorithms are available [1, 2, 3].

Learning a MCS usually includes two critical steps: base classifier generation, and multiple classifier combination, although sometimes the two steps are intrinsically integrated. Different MCS methods can be distinguished by how these two steps are performed. According to model generation strategies, existing MCS methods usually fall into one of the following two categories: random methods and deliberate methods. The former generates models by injecting random perturbations into the training data or training process [4, 5]. In contrast, the latter attempts to generate multiple classifiers in a more systematic, principled way, e.g., by iteratively re-weighting the training instances with emphasis on previously misclassified instances, a technique known as boosting [6], or by first clustering the training instances and then learning submodels from each cluster [7, 8]. According to model combination strategies, MCS methods can also be grouped into two categories: voting-based and learning-based. Most popular ensemble methods (e.g., bagging and boosting) take a (weighted) voting from all models in the pool. Other methods attempt to learn a high-level model in order to determine which model(s) should be selected for the prediction task, or to learn a more complex function to combine the outputs of all models in the pool. Learning-based model combination algorithms include stacking, dynamic model selection, among many others [9, 10].

Overall, ensemble approaches combining randomized model generation and voting (e.g. bagging and random forest) have been more successful / popular, probably due to their simplicity and less over-fitting. On the other hand, it has been shown that careful integration of deliberate models and learning-based model combination can be very effective on specific problem domains [11]. In particular, empirical studies suggest that many classification problems consist of subdomains, which can potentially benefit from constructing and selecting submodels [12, 7, 8]. The challenge, however, lies in whether these subdomains can be corrected identified at training, and whether the submodels can be correctly selected for individual cases at prediction time.

Here, we design a general MCS framework, Individualized Classifier Ensemble (ICE), with two key ideas. First, it constructs a large pool of submodels that have low bias when applied to appropriate instances. This is achieved by applying a strong learner (in contrast to the high-bias, low-variance models commonly used in a few ensemble methods) to individual overlapping clusters of instances that represent possible subproblems. Second, a simple yet effective, learning-free method is used to obtain different combinations of submodels for different testing instances. The learning-free nature of the method reduces the chance of selecting wrong models, therefore ensures that the combination of the selected submodels is better than, or at least no worse than, an average of all submodels.

Experimental results on 49 datasets from different domains show that ICE consistently outperforms the competing methods. Furthermore, detailed component analysis shows that both steps of our algorithm have positive contributions as expected. In addition, analysis of the submodels can shed light on the internal structure of the problem, which can potentially be used to further increase prediction performance, or to improve mechanistic understanding of the problem. The framework can be easily combined with existing classifiers and applied to many domains.

II Methods

Fig. 1 shows a brief overview of ICE, which starts with generating a pool of diverse and subdomain-representative classifiers from subsets of training instances (Algorithm 1), obtained by a graph-based clustering method that can detect overlapping clusters (Algorithm 2). Then, these classifiers are associated with individual training instances based on their relative prediction performance on the instance, taking into account model types and frequency (Algorithm 3). In testing/prediction stage, the nearest neighbors of a test instance are identified from the training dataset and the classifiers associated with these neighboring instances are selected to form an ensemble for prediction (Algorithm 4). While the general ICE framework is flexible and the individual components can be re-designed with domain-specific information, several design principles are crucial and are discussed below.

Source code and data are available at https://github.com/ds-utilities/ICE.

II-A Training

II-A1 Basic notations

We define a dataset of $Q$ training instances as $\bm{A}=\{(x_{i},y_{i})\}_{i=1}^{Q}$ , where $x_{i}\in\bm{X}$ is an $R$ dimensional feature vector and $y_{i}\in\bm{Y}$ is the binary label of instance $i$ . The clustering result on $\bm{X}$ is denoted as $\bm{C}=\{c_{i}\}_{i=1}^{L}$ ; $c_{i}$ is the $i$ th cluster; $L$ is the total number of clusters. Here we designate the last cluster $c_{L}$ of $\bm{C}$ to be the whole set of instances. Without loss of generality, we assume the class labels are binary.

II-A2 Graph-based Fuzzy Clustering

As clustering can be subjective and unstable, we recommend generating a large number of relatively independent but overlapping clusters. In addition, each cluster needs to have a sufficient number of instances to learn a strong submodel for that subdomain. In our design, we use a graph-based clustering algorithm that chooses a set of furthest points to initiate a random walk process and use probability cutoffs to control cluster size (Algorithm 2).

The algorithm works as follows. We first calculate an instance-instance distance matrix on $\bm{X}$ by Euclidean distance and store it in $\bm{S}$ . Then, we construct a KNN graph $\bm{G}$ by keeping the top $\lceil log_{10}Q\rceil$ neighbors for each node in $\bm{S}$ . Afterwards, a random walk with a restart probability $p$ (default to 0.3 in this work) is performed on the KNN graph $\bm{G}$ to obtain an affinity matrix, $\bm{W}$ [13]. Next, a set of points, $\bm{T}=\left\langle t_{j}\right\rangle_{(L-1)\times 1}$ , is identified as cluster centers: from $\bm{W}$ , the node with the largest total incoming probability, $t_{1}$ , is chosen as the center point of the first cluster; cluster centers for the other clusters are selected by finding the furthest node from the current center points. Finally, a probability cutoff is applied on $\bm{W}$ to identify direct neighbors of each cluster center as members of the cluster, such that the average cluster size is $z$ ( $z=Q/3$ as default). We designate the last cluster $c_{L}$ of $\bm{C}$ to be the whole set of instances. A classifier is built using instances from each cluster.

II-A3 Associating models to instances

Incorrect model selection can significantly degrade the performance of the algorithm compared to simply averaging all submodels. When the number of training instances is relatively small, supervised learning based model selection tends to overfit. Therefore, we propose a robust learning-free method (Algorithm 3), which performs model-instance association at training time and KNN-based model selection at prediction time. Importantly, the model-instance association step takes a Bayesian approach by using different cutoffs for different types of submodels, which reflects their frequency in the pool and the probability for them to outperform other types of submodels.

Formally, given the clustering result on instances, $\bm{C}=\{c_{i}\}_{i=1}^{L}$ , where $c_{L}$ is the whole set of instances, the corresponding set of models built on the clusters by a base learner (e.g., SVM) is denoted as $\bm{O}=\{o_{i}\}_{i=1}^{L}$ . Here we call a model $o_{i},i\in[1,L-1]$ as a ‘ $partial$ ’ model, since each model is built on a subset of the training instances, and, we call model $o_{L}$ as the ‘ $whole$ ’ model, which is built on the whole set of instances. The class probabilities predicted by all models are stored in $\bm{P}=\left\langle p_{ij}\right\rangle_{Q\times L}$ ; $p_{ij}$ is the predicted class probability for instance $i$ by model $j$ ; $p_{iL}$ is the prediction probability for instance $i$ by model built on the whole set of training instances. Note that if instance $i$ is NOT a member of cluster $c_{j}$ (in which case, we call model $o_{j}$ to be a ‘ $remote$ ’ model of instance $i$ ), the model is directly used to predict $p_{ij}$ for instance $i$ ; on the other hand, if instance $i$ is a member of cluster $c_{j}$ (in which case we call model $o_{j}$ a ‘ $local$ ’ model of instance $i$ ), the value $p_{ij}$ is obtained by 10-fold cross-validation using instances in this cluster. This process ensures that the performance evaluation used for model-instance association is not inflated, as an instance is never evaluated by a model that used the instance in training. Importantly, by not having any designated validation dataset, we are able to keep as many instances as possible for training, an important feature for small training data.

The prediction error table, $\bm{E}=\left\langle e_{ij}\right\rangle_{Q\times L}$ is derived from $\bm{P}$ ; $e_{ij}=\lvert p_{ij}-y_{i}\rvert$ is the prediction error for instance $i$ by model $o_{j}$ . Each row of $\bm{E}$ , $e_{i\bullet}$ , represents the prediction error of different models on instance $i$ . Given the empirical results that $local$ models usually work slightly better than $whole$ model and $remote$ models, as well as the fact that there are more $remote$ models than $local$ models in the pool, we introduce two parameters to easily balance the proportion of $local$ , $whole$ and $remote$ models in the ensemble: $w$ as the advantage score of the $whole$ model, and $s$ the advantage score of each $local$ model. Usually $s>w>0$ to promote the inclusion of $local$ models and demote $remote$ models, unless the error in a remote model is significantly smaller than in the $whole$ model. Each row of $\bm{E}$ is adjusted such that $e_{iL}\leftarrow(e_{iL}-w)$ , and, $e_{ij}\leftarrow(e_{ij}-s)$ if $x_{i}\in c_{j}$ . Then, the decision table, $\bm{D}=\left\langle d_{ij}\right\rangle_{Q\times L}$ , $d_{ij}\in\{1,0\}$ , where $d_{ij}=1$ indicates association between model $o_{j}$ and instance $i$ , is derived from the error table $\bm{E}$ , by

[TABLE]

II-B Testing / prediction

For a test instance $x_{t}$ , ICE first finds its $N$ nearest neighbors from the training dataset, then predicts its class label $y_{t}$ by averaging the class probabilities predicted by the models associated with the neighbor training instances (Algorithm 4). Formally, the PREDICT() algorithm first selects $N$ nearest neighbors of $x_{t}$ from $\bm{X}$ , and stores the indices of the neighbor instances in $K^{nb}=\left\langle k_{i}\right\rangle_{N\times 1}$ . Then, for each neighbor instance $k_{i}$ , the algorithm looks up in the corresponding decision table $d_{k_{i}\bullet}$ to find the models associated with the neighbor instance, and stores the associated ‘ $partial$ ’ models of $x_{t}$ in $\bm{O}^{nb}$ . The number of ‘ $partial$ ’ models in $\bm{O}^{nb}$ is denoted as $M$ . Note that although $o^{nb}_{i}\in\bm{O}$ , $\bm{O}^{nb}$ is not a subset of $\bm{O}$ , since $\bm{O}^{nb}$ may contain duplicated models. Then we denote $\bm{P}^{partial}=\left\langle p^{partial}_{i}\right\rangle_{M\times 1}$ as the ‘ $partial$ ’ model predictions, and each $p^{partial}_{i}$ is predicted by $o_{i}$ on $x_{t}$ . The predicted class probability by the whole model is denoted as $p^{whole}$ . Then the predicted class probability of $x_{t}$ is calculated by:

[TABLE]

where $\alpha$ is the parameter to balance the weight of ‘ $partial$ ’ models and the ‘ $whole$ ’ model; $\beta$ is the parameter to adjust the weight of ‘ $whole$ ’ models based on the number of top neighbors to ensure at least one $p^{whole}$ will be used in case there is no ‘ $partial$ ’ model.

In our experiments, $\alpha$ and $\beta$ are both set to 1 and N is set to 5, except in cases that we vary them to analyze the contribution of different components and the robustness of our algorithm’s performance.

II-C Relationship with Existing MCS Methods

ICE differs from most existing ensemble methods significantly in both model generation and model combination. Popular ensemble methods such as Bagging and Random Forest generate submodels using random subsets of data, and combine them using voting. In order for these methods to work effectively, a large number of submodels is needed to reduce overall bias. In contrast, ICE generates submodels to deliberately increase model diversity by clustering training instances. A carefully designed model-instance association algorithm helps identify the best ensemble for individual instances at prediction time. On the other hand, boosting generates submodels that focus on different groups of training instances, where grouping of instances is done implicitly by iterative re-weighting and therefore lack a global view of instance space. In addition, since there is no model selection at prediction time, boosting tend to overfit in the presence of noisy training instances.

Mixtures of experts is a class of neural network models attempting to simultaneously learn multiple submodels as well as a gating function that assigns each instance to one or more submodels [14, 15] . With similar idea, several methods use clustering as a preprocessing step for classification [7, 8]. These algorithms force each instance to be in a disjoint cluster, which reduces the number of instances at training time. In addition, prediction is done only by cluster-specific models so the cost of incorrect model selection is high. Empirical results presented in the original papers show mixed performance when compared to other MCS algorithms [14, 7, 8].

Finally, a series of methods have been developed recently under the common name ‘dynamic model selection’ [16, 17, 17, 18, 19, 20, 10, 21]. These approaches take an ensemble of base classifiers (e.g, from bagging), then attempt to learn a high-level classification model using, for example, instance-instance similarities and model-model correlations, as input features. While conceptually appealing, these methods tend to overfit and have poor performance when training data is limited. In our opinion, the marriage between random model generation and learning-based model combination is a poor choice, since the relatively small number of random models (compared to the possible number of instance combinations) does not guarantee that there is necessarily any predictably better submodel than a simple average of all submodels.

III Results and Discussion

III-A Data and Experimental Setup

Characteristics of the 49 benchmark datasets are shown in Figure 2. The datasets are collected from UCI machine learning repository and Kaggle Dataset for binary classification, with number of instance between 100 and 3,000, number of features between 3 and 1500, and percentage of majority class ranging from 50% to 77%. A total of 42 datasets from UCI and 23 datasets from Kaggle meet the criteria (18 of which appeared in both repositories). In addition, we add two cancer-related datasets - breast-cancer-nki [22] and breast-cancer-wang [23]. The data preprocessing mainly follows [24], which includes a $Z$ -Score transformation based normalization. For nine datasets with nominal features we use two different methods to handle nominal features: (i) removing nominal features (denoted with suffix ‘-1’ in Figure 2), (ii) using One-Hot encoding (denoted with suffix ‘-2’ in Figure 2). Since all features in dataset ‘tic-tac-toe’ are nominal, this dataset only has the One-Hot encoding version. Data and source code are available at https://github.com/ds-utilities/ICE.

Performance of each classification method is evaluated by 10-fold cross validation and measured by AUC. To facilitate a simple and fair evaluation, we use common parameter values for ICE on all datasets. The number of overlapping clusters, $L$ , is set to 100, which while not ideal for all data sets, makes evaluation easier. The advantage scores for ‘ $whole$ ’ model and ‘ $local$ ’ model are set to $w=0.4$ and $s=0.5$ respectively; this reflects the empirical observation that $local$ models usually have better performance than the other two types of models, and there are many $remote$ models so a higher cutoff score is needed for a $remote$ model to be associated with an instance. In prediction stage, the number of top neighbors parameter $N$ is set to 5; the parameter $\alpha$ and $\beta$ are both set to 1 for an overall balanced ‘ $partial$ ’ and ‘ $whole$ ’ models in the final weighting of prediction. The base model in the evaluation is linear-SVM with the regularization parameter $C$ =1 for ICE and comparison methods Bagging and AdaBoost. Bagging and AdaBoost use 100 bags and 100 iterations respectively. It is worth noting that these parameters are chosen intuitively without extensive tuning. Parameter analysis results show that the performance of ICE is robust with regarding to a wide range.

III-B Empirical Evidence Supporting Cluster-Based Ensemble Classification

To verify our assumption that, for each testing instance, some subset of training instances may provide a better classification model than the whole set of training instances, we perform a simple experiment as follows: first, each dataset is clustered into three disjoint clusters using k-means. We denote the clusters as cluster-a, b and c respectively, with their cluster size decreasing. Then using instances in each cluster for cross-testing: we compared the prediction AUC for each cluster using instances from cluster a, b, c or the whole dataset, respectively, as training data. We adopted notation a-b to denote the situation where we use the cluster a trained model to make predictions on cluster b instances.

To have a fair evaluation, when using a larger cluster to predict a smaller cluster, we randomly select the same number of instances from the larger cluster as the size of the smaller cluster to be the training data; when use a smaller cluster to predict a larger cluster, we use all instances in the smaller cluster and randomly select some instances from the larger cluster (making sure that they are not in the fold of testing) to be the training instances, such that the total number of training instances is the same as the number of instances in the larger cluster.

From Figure 3, with only three disjoint clusters, in more than 80% of the datasets, at least one of the $local$ models can outperform the $whole$ model (Figure 3a and b, columns a-a, b-b and c-c). Interestingly, while in general the $remote$ models do not perform well, some of them have the largest performance gain compared to the $whole$ model (column a-c and b-c). Collectively, this experiment shows the potential benefit of using a cluster of instances to improve prediction accuracy. On the other hand, the results also signifies the importance to predict, for each test instance, whether $partial$ models (and which) should be used.

III-C ICE Outperforms Existing MCS Algorithms

Figure 4 shows that ICE outperforms the corresponding Bagging classifier on most datasets, and, suffers from only minor performance loss on a few datasets. Notably, ICE uses less than 100 base models - on average 45 models per prediction. ICE may still have room for improvement on failed datasets by parameter tuning and improved clustering methods. Understandably, ICE tends to have less performance gain on datasets with fewer instances, such as on datasets 1 to 6, since ICE needs more enriched instance information for a meaningful clustering. From another perspective, ICE will have advantage on datasets with more instances and with more complex instance structure.

Table 1 shows the complete AUCs of three versions of ICE (with SVM, Bagging and AdaBoost as the base model) on 49 benchmark datasets compared to multiple MCS methods, including Bagging, Adaboost, and seven dynamic model selection approaches. META-DES [16] has two versions in this evaluation, using Perceptron (the base classifier choice of the original META-DES paper) and Bagging (comparable with Bagging and ICE-Bagging) respectively. The base classifier is Bagging for the other six dynamic model selection methods - KNORA-U [17], KNORA-E [17], DES-PRC [18, 19], OLA [20], MCB [10] and A Priori [21], which is the suggested setting plus SVM to make comparable with other methods. We use the suggested parameters for dynamic model selection approaches [25].

As shown, all three versions of ICE have better performance than the other methods. The performance gain of ICE over Bagging can be attributed to the use of specifically generated models for subproblems and individualized model association and selection step. Comparing AdaBoost to ICE, both models attempt to produce subdomain-specific classifiers; however, AdaBoost always uses the same ensemble of all submodels for all instances, which reduces the potential performance gain provided by the submodel-specific models. Therefore, ICE-Adaboost and even ICE-SVM perform better than AdaBoost in general. More over, ICE outperforms the seven dynamic selection methods. Each of the dynamic selection methods has unique contributions on model selection or integration. However, none of them focuses on deliberately generating models for specific subproblems as the fuzzy clustering that ICE uses. In addition, the unique instance-model association of ICE can utilize all training instances, comparing to dynamic selection methods such as META-DES, which separates training data into META learning and dynamic selection datasets, therefore lead to more data loss and weaker base classifiers. As discussed earlier in Section 2E, learning the best combination of multiple randomly generated models can be a daunting task when the amount of training data is limited.

III-D Randomized Control Analysis Reveals The Effectiveness of Different Components of ICE

To understand the impact of the three components of ICE (C1: fuzzy clustering based model generation, C2: instance-model association, and, C3: KNN-based model selection), we perform a randomized control experiment, where one or more of the components is replaced with comparable, randomized procedures. To randomize C1, the fuzzy clustering is replaced by bootstrapping instances , where the bags are made the same size as in the fuzzy clusters, therefore resulting in a slightly modified version of Bagging. To randomize C2, the decision table is shuffled row-wise, destroying the association of models to instances. Finally, to randomize C3, KNN is replaced with random selection of instances. Note that randomizing C2 or C3 (or both) are expected to have similar impact on the algorithm, which will essentially perform random model selection (and in most cases will choose many more models than real ICE due to independence of different rows of the randomized decision table).

Figure 5 shows the performance of ICE with different components randomized. Here, in order to show the effectiveness of each component of ICE, the parameter $\alpha$ and $\beta$ are set to 0, effectively eliminating ‘ $whole$ ’ model. Not surprisingly, when both the model generation and model selection components of ICE are randomized (columns 1-3 in Figure 5a), its performance becomes similar to that of Bagging. On the other hand, when only one component is randomized (columns 4-7), ICE can still perform better than standard Bagging, although not as effective as the complete ICE algorithm (column 8), indicating that both components of ICE played a role in effective learning.

Interestingly, with only C1 randomized, our algorithm is conceptually similar to dynamic model selection [16], except that we replaced their learning-based model selection with simple KNN-based model selection. The fact that this version of ICE still outperforms dynamic model selection suggests that, with limited training data, KNN-based model selection can have more robust performance than learning-based model selection. In addition, when C2 or C3 (or both) are randomized but C1 is not randomized (column 5-7), our algorithm is conceptually similar to bagging, except that the models in the ensemble are based on clusters of instances instead of random selection of instances. As shown, this version of ICE has significant performance gain over Bagging, suggesting that, at least in these datasets, clustering-based model generation, which implicitly diversifies the models, can be better than randomized model generation.

III-E Performance of ICE is Robust in a Wide Range of Parameter Space

Figure 6 shows the results of ICE using a wide range of parameters - $N$ : number of neighbors per testing instances in prediction; $w$ : the weight advantage of the base whole model in model-instance association; $s$ : the weight advantage of the self-model in model-instance association. In this analysis, the parameter $\alpha$ and $\beta$ are both set to 1 to balance $whole$ and $partial$ models.

Figure 6a shows that the number of nearest neighbors used in model selection has only slight impact on AUC gain on average across all 49 datasets. The recommended setting of $N$ is 5 to 10 for a balanced running speed and accuracy. ICE works best when there are strong patterns in the dataset. If ICE does not have a significant gain over Random Forest (RF) on a center dataset, a larger $N$ setting will make ICE more stable and closer to bagging. ICE still has a large room of improvement on specific dataset by using more suitable fuzzy clustering algorithm, which is one of our future work.

Figure 6b and Figure 6c shows the robust performance of ICE with respect to parameter $w$ and $s$ . A general insight of $w$ and $s$ is to set s slightly larger than $w$ , such as $s$ = 0.5, $w$ =0.4. The parameter $\alpha$ and $\beta$ are quite simple to choose. Set both $\alpha$ and $\beta$ to 1 will lead to a decent result for most of cases; try to set both $\alpha$ and $\beta$ to 0 if there are strong clusters within the dataset, and the extreme localized classifiers may have an advantage over the basic to-go choice where $\alpha=\beta=1$ .

In addition, it is worth noting that the parameters used in the experimental setup have not been tuned for individual dataset in this study. There is a potential to perform model tuning on each dataset for even more improved performance.

III-F ICE Significantly Improves Random Forest Performance

We further perform an extreme comparison between ICE (using Random Forest with 100 trees as the base classifier) and Random Forest with 10,000 trees. Random Forest (RF) is well known for its stable high performance with almost tuning-free design, and is well positioned to be a benchmark classifier. As shown in Figure 7, ICE significantly improves the performance of Random Forest; ICE wins or ties over RF on 36 out of 49 datasets (74%), and has minor performance loss on 13 datasets. The $t$ -test $p$ -value of gain = 0.018, which is significant ( $p$ -value $<0.02$ ), and, there are 7 datasets (highlighted on Figure 7) with AUC gain over 8% (among these, ICE has AUC gain over 13.4% on 4 datasets), while no dataset with AUC loss over 3%. Note that ICE only uses on average 47 models per prediction, much fewer comparing to 10,000 trees by the RF classifier. Moreover, RF easily reaches its performance limits as the number of trees grows, while ICE has a much larger room of improvement as the number of submodels increases. Performance of ICE can be further improved by increasing the number of fuzzy clusters (submodels) or using more suitable clustering methods.

The performance gain of ICE over RF can be attributed to the use of specifically generated models for subproblems and individualized model association and selection step. Interestingly, the AUC gain of ICE is correlated with the result from Figure 3a - the 7 highest-scoring datasets by ICE on Figure 7 have on average 4.3 ‘ $partial$ ’ models winning the ‘ $whole$ ’ model, while this statistic is only 2.6 for the other datasets; the average AUC gain by ‘ $partial$ ’ models of ICE on these 7 datasets is 0.077, while it is 0.057 for the other datasets. This results not only further validates our intuition of using ‘ $partial$ ’ models to improve classification performance, but also suggests that the performance of ICE can be partially predictable based on dataset characteristics, which is a very important feature in practice.

III-G Classification Improved by Accurately Predicting ‘Hard’ Instances with ‘ $partial$ ’ Models

ICE has a stable AUC gain on most of datasets over a large range of parameter variation, and the dataset with one of the most dramatic improvement using ICE is the 15-breast-cancer-1 dataset. As shown in Figure 8a, as the parameter $N$ increases, the average number of models per instance also increases and the performance of ICE continues to increase, reaching a plateau after $N\geq 25$ . In addition, analysis of the models used by each test instance of ICE shows an interesting bimodal distribution: most of the test instances (262 out of 286 cases) use less than 20 models (mostly $local$ models); in contrast, a few instances (24 cases) use more than 40 unique models (including both $local$ and $remote$ models) (Figure 8b), which are presumably the more difficult instances that are hard to be clustered and/or classified.

Comparing the performance difference on these two groups of instances, we can see that ICE has a much lower prediction error when compared to the‘ $whole$ ’ model on instances with $>$ 30 models by ICE (Figure 8c). The $t$ test $p$ -value of the error differences between the‘ $whole$ ’ model prediction and ICE prediction on instances with $>$ 30 models (24 cases) is significant ( $8.98\times 10^{-6}$ ). This result demonstrates that different instances should be treated differently on this dataset, and the ICE algorithm shows a potential way of separating and treating these different instances.

III-H AUC Gain of ICE has a Strong Correlation with the Data-Decision Table Similarity

In this work, instances are clustered based on their similarities in the feature space. However, it is possible that this clustering may not be optimal in revealing model heterogeneity. A different view may be obtained by analyzing the instance-instance similarities in the model space. Therefore, we use the decision table, which describes the prediction performance of each model on each instance, to measure instance-instance similarity, and inspect whether the consistency between these two types of similarity measures can be predictive of the performance of ICE.

Indeed, as shown in Figure 9, there exists a strong positive correlation (Pearson correlation coefficient = 0.425) between AUC gain and feature-model consistency, where the consistency is defined as the Pearson correlation correlation between the instance-instance similarities measured in the feature space and the instance-instance similarities measured using the decision table entries. This result indicates the potential of improving our current work by feature selection and better clustering method on data $X$ . Our intuition is that some features are more related with a classification task than the other features, and, we should be able to use these features for clustering for the classification task rather than use all the features. This also explains that the AUC gain on dataset 15-breast-cancer-1 (3 features, AUC gain = 0.126) is much larger than the AUC gain on dataset 16-breast-cancer-2 (50 features, AUC gain = 0.024). The three features of dataset 15-breast-cancer-1 are ‘tumor-size’, ‘left or right breast’ and ‘if irradiate’, and, all the non-binary nominal features in the original breast cancer dataset from [24] has been removed, while the dataset 16-breast-cancer-2 keeps all the other nominal features by One-Hot encoding. It is reasonable to imagine that the clustering on dataset 16 is influenced by some of the over-complicated and irrelevant features (for the classification task); therefore, the models built on those clusters are not optimized for the classification task. A potential future improvement is to cluster instances based on the output values from different models instead of on the feature values, or using both in an iterative manner.

IV Conclusion

Based on the intuition that classifiers generated from different subdomains of training instances are needed in classification task, we proposed ICE, a novel multiple classifier generation and combination framework, which generally increases the diversity among submodels, and successfully associates the submodels to subdomains of instances. Evaluation results on 49 benchmarks show that our model has a stable improvement on a significant proportion of datasets over multiple existing MCS methods. A detailed component analysis shows that the different components of our algorithm work coordinately to achieve its performance. We believe that ICE can provide a novel choice of utilizing subdomain models to improve classification.

Acknowledgment

This research was supported in part by grants from the National Science Foundation (award number IIS-1218201 and ABI-1565076), and the National Institutes of Health (award number G12MD007591 and U54CA217297).

Bibliography25

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] L. Rokach, “Ensemble-based classifiers,” Artificial Intelligence Review , vol. 33, no. 1-2, pp. 1–39, 2010.
2[2] L. I. Kuncheva, Combining pattern classifiers: methods and algorithms . John Wiley & Sons, 2004.
3[3] N. C. Oza and K. Tumer, “Classifier ensembles: Select real-world applications,” Information Fusion , vol. 9, no. 1, pp. 4–20, 2008.
4[4] L. Breiman, “Bagging predictors,” Machine learning , vol. 24, no. 2, pp. 123–140, 1996.
5[5] ——, “Random forests,” Machine learning , vol. 45, no. 1, pp. 5–32, 2001.
6[6] Y. Freund and R. E. Schapire, “A desicion-theoretic generalization of on-line learning and an application to boosting,” in European conference on computational learning theory . Springer, 1995, pp. 23–37.
7[7] R. Vilalta, M.-K. Achari, and C. F. Eick, “Class decomposition via clustering: a new framework for low-variance classifiers,” in Data Mining, 2003. ICDM 2003. Third IEEE International Conference on . IEEE, 2003, pp. 673–676.
8[8] L. I. Kuncheva, “Clustering-and-selection model for classifier combination,” in Knowledge-Based Intelligent Engineering Systems and Allied Technologies, 2000. Proceedings. Fourth International Conference on , vol. 1. IEEE, 2000, pp. 185–188.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Taxonomy

A Novel Multiple Classifier Generation and Combination Framework Based on Fuzzy Clustering and Individualized Ensemble Construction

Abstract

Index Terms:

I Introduction

II Methods

II-A Training

II-A1 Basic notations

II-A2 Graph-based Fuzzy Clustering

II-A3 Associating models to instances

II-B Testing / prediction

II-C Relationship with Existing MCS Methods

III Results and Discussion

III-A Data and Experimental Setup

III-B Empirical Evidence Supporting Cluster-Based Ensemble Classification

III-C ICE Outperforms Existing MCS Algorithms

III-D Randomized Control Analysis Reveals The Effectiveness of Different Components of ICE

III-E Performance of ICE is Robust in a Wide Range of Parameter Space

III-F ICE Significantly Improves Random Forest Performance

III-G Classification Improved by Accurately Predicting ‘Hard’ Instances with ‘partialpartialpartial’ Models

III-H AUC Gain of ICE has a Strong Correlation with the Data-Decision Table Similarity

IV Conclusion

Acknowledgment

III-G Classification Improved by Accurately Predicting ‘Hard’ Instances with ‘ $partial$ ’ Models