TL;DR
This paper investigates various strategies for combining linear classifiers using score functions, comparing their effectiveness through experiments with heterogeneous ensembles and multiple quality metrics.
Contribution
It introduces and evaluates four combination strategies for linear classifiers based on score functions, providing insights into their relative performance.
Findings
Simple average and trimmed average are the most effective combination strategies.
The proposed methods outperform majority voting and model averaging in several quality criteria.
Experimental results validate the effectiveness of geometrical combination strategies.
Abstract
In this work, we addressed the issue of combining linear classifiers using their score functions. The value of the scoring function depends on the distance from the decision boundary. Two score functions have been tested and four different combination strategies were investigated. During the experimental study, the proposed approach was applied to the heterogeneous ensemble and it was compared to two reference methods -- majority voting and model averaging respectively. The comparison was made in terms of seven different quality criteria. The result shows that combination strategies based on simple average, and trimmed average are the best combination strategies of the geometrical combination.
| Name | Name | Name | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| appendicitis | 106 | 7 | 2 | 2.52 | housevotes | 435 | 16 | 2 | 1.29 | shuttle | 57999 | 9 | 7 | 1326.03 |
| australian | 690 | 14 | 2 | 1.12 | ionosphere | 351 | 34 | 2 | 1.39 | sonar | 208 | 60 | 2 | 1.07 |
| balance | 625 | 4 | 3 | 2.63 | iris | 150 | 4 | 3 | 1.00 | spambase | 4597 | 57 | 2 | 1.27 |
| banana2D | 2000 | 2 | 2 | 1.00 | led7digit | 500 | 7 | 10 | 1.16 | spectfheart | 267 | 44 | 2 | 2.43 |
| bands | 539 | 19 | 2 | 1.19 | lin1 | 1000 | 2 | 2 | 1.01 | spirals1 | 2000 | 2 | 2 | 1.00 |
| Breast Tissue | 105 | 9 | 6 | 1.29 | lin2 | 1000 | 2 | 2 | 1.83 | spirals2 | 2000 | 2 | 2 | 1.00 |
| check2D | 800 | 2 | 2 | 1.00 | lin3 | 1000 | 2 | 2 | 2.26 | spirals3 | 2000 | 2 | 2 | 1.00 |
| cleveland | 303 | 13 | 5 | 5.17 | magic | 19020 | 10 | 2 | 1.42 | texture | 5500 | 40 | 11 | 1.00 |
| coil2000 | 9822 | 85 | 2 | 8.38 | mfdig fac | 2000 | 216 | 10 | 1.00 | thyroid | 7200 | 21 | 3 | 19.76 |
| dermatology | 366 | 34 | 6 | 2.41 | movement libras | 360 | 90 | 15 | 1.00 | titanic | 2201 | 3 | 2 | 1.55 |
| diabetes | 768 | 8 | 2 | 1.43 | newthyroid | 215 | 5 | 3 | 3.43 | twonorm | 7400 | 20 | 2 | 1.00 |
| Faults | 1940 | 27 | 7 | 4.83 | optdigits | 5620 | 62 | 10 | 1.02 | ULC | 675 | 146 | 9 | 2.17 |
| gauss2DV | 800 | 2 | 2 | 1.00 | page-blocks | 5472 | 10 | 5 | 58.12 | vehicle | 846 | 18 | 4 | 1.03 |
| gauss2D | 4000 | 2 | 2 | 1.00 | penbased | 10992 | 16 | 10 | 1.04 | Vertebral Column | 310 | 6 | 3 | 1.67 |
| gaussSand2 | 600 | 2 | 2 | 1.50 | phoneme | 5404 | 5 | 2 | 1.70 | wdbc | 569 | 30 | 2 | 1.34 |
| gaussSand | 600 | 2 | 2 | 1.50 | pima | 767 | 8 | 2 | 1.44 | wine | 178 | 13 | 3 | 1.23 |
| glass | 214 | 9 | 6 | 3.91 | ring2D | 4000 | 2 | 2 | 1.00 | winequality-red | 1599 | 11 | 6 | 20.71 |
| haberman | 306 | 3 | 2 | 1.89 | ring | 7400 | 20 | 2 | 1.01 | winequality-white | 4898 | 11 | 7 | 82.94 |
| halfRings1 | 400 | 2 | 2 | 1.00 | saheart | 462 | 9 | 2 | 1.44 | wisconsin | 699 | 9 | 2 | 1.45 |
| halfRings2 | 600 | 2 | 2 | 1.50 | satimage | 6435 | 36 | 6 | 1.66 | yeast | 1484 | 8 | 10 | 17.08 |
| hepatitis | 155 | 19 | 2 | 2.42 | Seeds | 210 | 7 | 3 | 1.00 | |||||
| HillVall | 1212 | 100 | 2 | 1.01 | segment | 2310 | 19 | 7 | 1.00 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Nam | Zero-One | MaFDR | MaFNR | ||||||||||||||||||
| Frd | 5.729e-14 | 2.873e-04 | 1.791e-08 | ||||||||||||||||||
| Rnk | 2.98 | 3.78 | 3.36 | 3.73 | 5.72 | 3.56 | 4.87 | 3.76 | 4.45 | 3.41 | 3.93 | 4.75 | 3.31 | 4.39 | 4.27 | 5.32 | 3.32 | 3.58 | 4.00 | 3.09 | 4.42 |
|
|
.007 | .091 | .002 | .000 | .161 | .000 | .016 | .295 | .969 | .155 | .279 | .673 | .000 | .003 | .505 | 1.00 | .000 | 1.00 | |||
|
|
.968 | .968 | .000 | .968 | .007 | .001 | .025 | .878 | .002 | .295 | .000 | .000 | .018 | .000 | .002 | ||||||
|
|
.080 | .000 | .968 | .000 | .056 | .013 | .878 | .025 | .049 | .139 | 1.00 | .008 | |||||||||
|
|
.000 | .846 | .000 | .028 | .056 | .155 | .601 | .049 | .016 | ||||||||||||
|
|
.000 | .000 | .004 | .295 | .139 | 1.00 | |||||||||||||||
|
|
.000 | .003 | .001 | ||||||||||||||||||
| Nam | MaF1 | MiFDR | MiFNR | ||||||||||||||||||
| Frd | 2.641e-09 | 5.729e-14 | 5.729e-14 | ||||||||||||||||||
| Rnk | 3.96 | 5.10 | 3.23 | 3.59 | 4.81 | 2.96 | 4.35 | 2.98 | 3.78 | 3.36 | 3.73 | 5.72 | 3.56 | 4.87 | 2.98 | 3.78 | 3.36 | 3.73 | 5.72 | 3.56 | 4.87 |
|
|
.000 | .017 | .548 | .117 | .000 | .340 | .007 | .091 | .002 | .000 | .161 | .000 | .007 | .091 | .002 | .000 | .161 | .000 | |||
|
|
.000 | .000 | .315 | .000 | .017 | .968 | .968 | .000 | .968 | .007 | .968 | .968 | .000 | .968 | .007 | ||||||
|
|
.017 | .002 | .454 | .001 | .080 | .000 | .968 | .000 | .080 | .000 | .968 | .000 | |||||||||
|
|
.007 | .014 | .011 | .000 | .846 | .000 | .000 | .846 | .000 | ||||||||||||
|
|
.000 | .185 | .000 | .000 | .000 | .000 | |||||||||||||||
|
|
.000 | .000 | .000 | ||||||||||||||||||
| Nam | MiF1 | ||||||||||||||||||||
| Frd | 5.729e-14 | ||||||||||||||||||||
| Rnk | 2.98 | 3.78 | 3.36 | 3.73 | 5.72 | 3.56 | 4.87 | ||||||||||||||
|
|
.007 | .091 | .002 | .000 | .161 | .000 | |||||||||||||||
|
|
.968 | .968 | .000 | .968 | .007 | ||||||||||||||||
|
|
.080 | .000 | .968 | .000 | |||||||||||||||||
|
|
.000 | .846 | .000 | ||||||||||||||||||
|
|
.000 | .000 | |||||||||||||||||||
|
|
.000 | ||||||||||||||||||||
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
11institutetext: Department of Systems and Computer Networks, Wroclaw University of Science and Technology,
Wybrzeze Wyspianskiego 27, 50-370 Wroclaw, Poland
Combination of linear classifiers using score function – analysis of possible combination strategies
Pawel Trajdos
Robert Burduk
Abstract
In this work, we addressed the issue of combining linear classifiers using their score functions. The value of the scoring function depends on the distance from the decision boundary. Two score functions have been tested and four different combination strategies were investigated. During the experimental study, the proposed approach was applied to the heterogeneous ensemble and it was compared to two reference methods – majority voting and model averaging respectively. The comparison was made in terms of seven different quality criteria. The result shows that combination strategies based on simple average, and trimmed average are the best combination strategies of the geometrical combination.
Keywords:
binary classifiers, linear classifiers, geometrical space, potential function
1 Introduction
The combination of multiple base classifiers has been an important issue in machine learning for about twenty years [8], [35]. The ensembles of classifiers (EoC) or multiple classifiers systems (MCSs) [5], [21], [11], [26], [34] are popular in supervised classification algorithms where single classifiers are often unstable (small changes in input data may result in creation of very different decision boundaries) or are often more accurate than any of the base classifiers.
The task of constructing MCSs can be generally divided into three steps: generation, selection and integration [2]. In the first step a set of base classifiers is trained using manipulation of the training patterns, manipulation of the training parameters or manipulation of the feature space.
The second phase of building EoCs is related to the choice of a set or one classifier from the whole available pool of base classifiers. It is popular to use the diversity measure to select one classifier or a subset of all base classifiers. In the literature, there are many approaches to the selection phase of building EoCs [17], [3], [28], [27].
The integration process is the last stage of constructing EoCs and it is widely discussed in the pattern recognition literature [24], [32]. Generally, supervised learning methods produce a classifier whose output is represented as a score function. This function is mapping to a function that is interpreted as a posteriori probability, rank level function or directly as a class label. Depending on the type of mapping, many methods for integrating base classifiers can be distinguished [19], [25], [31].
In this paper we propose the concept of the classifier integration process which uses score functions without their further transformation. In this paper we examined two forms of the score function that is called the potential function and four different combination strategies were investigated.
The remainder of this paper is organized as follows. Section 2 presents the proposed method of EoC integration using two types of the potential function. The experimental evaluation is presented in Section 3. The discussion and conclusions from the experiments are presented in Section 4.
2 Proposed Method
In this section, the proposed approach is explained. Additionally, this section introduces the notation used in this paper.
2.1 Linear Binary Classifiers
In this paper, it is assumed that the input space is a Euclidean space . Each object from the input space belongs to one of two available classes, so the output space is: . It is assumed that there exists an unknown mapping that assigns each input space coordinates into a proper class. A classifier is a function that is designed to provide an approximation of the unknown mapping . A linear classifier makes its decision according to the following rule:
[TABLE]
where is the so called discriminant function of the classifier [19], is a unit normal vector of the decision hyperplane (), is the distance from the hyperplane to the origin and is a dot product defined as follows:
[TABLE]
In this paper, we use a norm of the vector defined using the dot product:
[TABLE]
When the normal vector of the plane is a unit vector, the absolute value of the discriminant function equals to the distance from the decision hyperplane to point . The sign of the discriminant function depends on the site of the plane where the instance lies.
Now, let us define an ensemble classifier:
[TABLE]
that is a set of classifiers that work together in order to produce a more robust result [19]. In this paper, it is assumed that only linear, binary classifiers are employed. There are multiple strategies to combine the classifiers constituting the ensemble. The simplest strategy to combine the outcomes of multiple classifiers is to apply the majority voting scheme [19]:
[TABLE]
where is the value of the discriminant function provided by the classifier for point . However, this simple yet effective strategy completely ignores the distance of the instance from the decision planes.
Another strategy is model averaging [29]. The output of the averaged model may be calculated by simply averaging the values of the discriminant functions:
[TABLE]
After combining the base classifiers, the final prediction of the ensemble is obtained according to the rule (1).
2.2 The Proposed Method
In this paper, an approach similar to the softmax [19] normalization is proposed. Contrary to the softmax normalization, our goal is not to provide a probabilistic interpretation of the linear classifier but to provide a fusion technique that works in the geometrical space. The idea is to span a potential field around the decision plane. The potential field may be constructed by applying a transformation on the value of the discriminant function. The transformation must meet the following properties:
[TABLE]
Property (7) assures that the crisp decision based on the transformed value is the same as the decision based on the unmodified discriminant function. Property (8) bounds in interval . However, contrary to the softmax normalization the transformation does not have to be a sigmoid function. Property (9) assures that the potential is 0 at the surface of the decision plane. In this paper, the following transformation function is used:
[TABLE]
where is a coefficient that determines the position and steepness of the peak. The translation constant and the scaling factor guarantee that the maximum and minimum values are and respectively. The function is visualised in the figure 1.
All models in the ensemble share the same shape coefficient . The shape coefficient is tuned in order to achieve the best quality of the entire ensemble.
After transforming the values of discriminant functions for the entire ensemble, there is a need to combine the outcomes to produce the final decision. In this paper, we analyze four different combination rules. The first one is a simple average of the transformed values of discriminant functions:
[TABLE]
The other one is to apply the trimmed mean approach:
[TABLE]
Before the remaining combination rules are defined, let us introduce subsets of negative and positive values of the transformed ensemble outcomes:
[TABLE]
Then, the remaining rules are as follows:
[TABLE]
where and are cardinality of set and the absolute value of respectively.
The proposed algorithm is able to deal only with the binary classification problems. However, any multi-class problem can be decomposed into multiple binary problems. In the experimental stage the One-vs-One strategy was used [16]. This strategy builds a separate binary classifier for each pair of classes. In our method, a single pair-specific is replaced by the above-described ensemble classifier.
3 Experimental Setup
In the conducted experimental study, the proposed approach was used to combine classifiers in the heterogeneous ensemble of classifier. The following base classifiers were employed:
- •
– Fisher LDA[22]
- •
– single layer MLP classifier[12]
- •
– nearest centroid (Nearest Prototype)[20, 18]
- •
– SVM classifier with linear kernel (no kernel) [4],
- •
– logistic regression classifier [7].
The classifiers implemented in WEKA framework [13] were used. The classifier parameters were set to their defaults. The multi-class problems were dealt with using One-vs-One decomposition [16]. The experimental code was implemented using WEKA framework [13].The source code of the algorithms is available online 111https://github.com/ptrajdos/piecewiseLinearClassifiers/tree/master. The heterogeneous ensemble employs one copy of each of the above-mentioned base classifiers. Each classifier is learned using the entire dataset.
During the experimental evaluation the following combination methods were compared:
– the ensemble combined using the majority voting approach, 2. 2.
– the ensemble combined using the model averaging approach, 3. 3.
– the ensemble combined using the rule described in (11). 4. 4.
– the ensemble combined using the rule described in (15). 5. 5.
– the ensemble combined using the rule described in (16). 6. 6.
– the ensemble combined using the rule described in (17). 7. 7.
– the ensemble combined using the rule described in (17).
The coefficient for transformation and was tuned using the grid search approach. The following set of parameter values were investigated:
[TABLE]
The parameter is chosen in such a way that it provides the maximum value of the macro-averaged criterion.
To evaluate the proposed methods the following classification-quality criteria are used [30]: Zero-one loss (Accuracy); Macro-averaged , , ;Micro-averaged , , .
Following the recommendations of [6] and [10], the statistical significance of the obtained results was assessed using the two-step procedure. The first step is to perform the Friedman test [9] for each quality criterion separately. Since the multiple criteria were employed, the familywise errors (FWER) should be controlled [36]. To do so, the Bergman-Hommel [1] procedure of controlling FWER of the conducted Friedman tests was employed. When the Friedman test shows that there is a significant difference within the group of classifiers, the pairwise tests, which use the Wilcoxon signed-rank test [33], [6] were employed. To control FWER of the Wilcoxon-testing procedure, the Bergman-Hommel approach was employed [15]. For all the tests the significance level was set to .
Table 1 displays the collection of the benchmark sets that were used during the experimental evaluation of the proposed algorithms. The table is divided into two columns. Each column is organized as follows. The first column contains the names of the datasets. The remaining ones contain the set-specific characteristics of the benchmark sets: the number of instances in the dataset (); dimensionality of the input space (); the number of classes ();average imbalance ratio ().
The datasets come from the Keel 222https://sci2s.ugr.es/keel/category.php?cat=clas repository or are generated by us. The datasets are available online 333https://github.com/ptrajdos/MLResults/blob/master/data/slDataFull.zip.
During the dataset-preprocessing stage, a few transformations on datasets were applied. That is, features are selected using the correlation-based approach [14]. Then, the PCA method was applied [23] and the percentage of variance was set to . The attributes were also scaled to fit the interval . Additionally, in order to ensure the dot product to be in the interval , vectors in each dataset were scaled using the factor . This normalization makes it easier to find proper .
4 Results and Discussion
To compare multiple algorithms on multiple benchmark sets the average ranks approach [6] is used. In the approach, the winning algorithm achieves rank equal ’1’, the second achieves rank equal ’2’, and so on. In the case of ties, the ranks of algorithms that achieve the same results, are averaged. To provide a visualisation of the average ranks, the radar plots are employed. In the plots, the data is visualised in such a way that the lowest ranks are closer to the centre of the graph. The radar plots related to the experimental results are shown in figure 2.
Due to the page limit, the full results are published online 444https://github.com/ptrajdos/MLResults/blob/master/Boundaries/bounds_hetero_15.01.2019E4_m_R.zip
The numerical results are given in Table 2. The table is structured as follows. The first row contains names of the investigated algorithms. Then, the table is divided into seven sections – one section is related to a single evaluation criterion. The first row of each section is the name of the quality criterion investigated in the section. The second row shows the p-value of the Friedman test. The third one shows the average ranks achieved by algorithms. The following rows show p-values resulting from pairwise Wilcoxon test. The p-value which is equal to informs that the p-values are lower than and p-value is equal to informs that the value is higher than .
The analysis of the radar plot suggests that two groups of classification criteria can be distinguished. The first group contains micro-averaged criteria and the zero-one criterion, the second one contains macro-averaged criteria. Evaluation of the classifiers carried out with the use of criteria belonging to a specific group reveals different relationships between classifiers. These differences are a consequence of the properties of the quality criteria used. This means that the zero-one criterion and micro-averaged criteria give us information related to the classification quality for the majority classes. On the other hand, the macro-averaged criteria put more emphasis on classification quality for minority classes [30].
For the zero-one criterion and micro-averaged criteria, three main groups of classifiers can be seen. The first group contains and classifiers that perform significantly worse than the other analysed classifiers. What is more, classifier is significantly worse than for all quality criteria belonging to the investigated group. The second group contains only one classifier – . According to average ranks, this classifier is the best performing one for the investigated set of quality criteria. According to the statistical analysis, this classifier outperforms the remaining classifiers except for and . The third group consisted of classifiers , , , and . There are no significant differences between the classifiers within this group.
For macro-averaged measures, the situation changes significantly. First of all, it may be noticed that average ranks of reference methods ( and ) increase, whereas the average ranks of the proposed methods decrease. That is, the model-averaging classifier becomes the worst one except for according to macro-averaged and FNR criteria. The majority voting classifier also deteriorates significantly. Now it is comparable to , and classifiers. What is more, classifier is outperformed by and classifiers in terms of macro-averaged FNR and criteria. The reason for the above-mentioned deterioration of the reference methods is the fact that they are not tuned to perform better on minority classes, whereas the investigated methods were tuned to do so.
Now let us investigate the differences inside the group of the proposed combination criteria. First of all, classifiers and offer the best classification quality under macro-averaged measure. It means that these classifiers offer the best trade-off between macro-averaged precision and recall. Under macro-averaged FDR () measure, these algorithms outperform only and classifiers. For macro-averaged FNR () the investigated classifiers outperform all but classifiers. On the other hand, under the macro-averaged measures, there are no significant differences between and .
5 Conclusions
In this paper, a geometric combination scheme was proposed. Four different methods of producing the final output of EoC were investigated. The goal of this paper is to determine the best combination strategy for the given potential-function-induced geometrical space. The experimental comparison shows that and algorithms are the best choice. This is because under macro-averaged measures they are outperforming the other proposed strategies and reference methods. What is more, under the micro-averaged criteria they are comparable to the majority voting procedure. According to the outcome of the statistical evaluation, these algorithms perform equally well. However, under macro-averaged measures, achieves a slightly lower average rank. This suggests that may be slightly better since the truncated mean combination rule removes extreme values of the potential function so it may be less influenced by outliers.
The obtained results are very interesting, so we are willing to continue our research in the field of combining classifiers in the geometrical space. An interesting direction to explore may be the application of the potential function whose shape is not given arbitrary but is created considering data distribution.
Acknowledgments. This work was supported in part by the National Science Centre, Poland under the grant no. 2017/25/B/ST6/01750.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Bergmann, B., Hommel, G.: Improvements of general multiple test procedures for redundant systems of hypotheses. In: Multiple Hypothesenprüfung / Multiple Hypotheses Testing, pp. 100–115. Springer Berlin Heidelberg (1988). https://doi.org/10.1007/978-3-642-52307-6_8, https://doi.org/10.1007/978-3-642-52307-6_8 · doi ↗
- 2[2] Britto, A.S., Sabourin, R., Oliveira, L.E.: Dynamic selection of classifiers—–a comprehensive review. Pattern Recognition 47 (11), 3665–3680 (2014)
- 3[3] Burduk, R., Walkowiak, K.: Static classifier selection with interval weights of base classifiers. In: Asian Conference on Intelligent Information and Database Systems. pp. 494–502. Springer (2015)
- 4[4] Cortes, C., Vapnik, V.: Support-vector networks. Mach Learn 20 (3), 273–297 (Sep 1995). https://doi.org/10.1007/bf 00994018
- 5[5] Cyganek, B.: One-class support vector ensembles for image segmentation and classification. Journal of Mathematical Imaging and Vision 42 (2-3), 103–117 (2012)
- 6[6] Demšar, J.: Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research 7 , 1–30 (2006)
- 7[7] Devroye, L., Györfi, L., Lugosi, G.: A Probabilistic Theory of Pattern Recognition. Springer New York (1996). https://doi.org/10.1007/978-1-4612-0711-5, http://dx.doi.org/10.1007/978-1-4612-0711-5 · doi ↗
- 8[8] Drucker, H., Cortes, C., Jackel, L.D., Le Cun, Y., Vapnik, V.: Boosting and other ensemble methods. Neural Computation 6 (6), 1289–1301 (1994)
