Confusion matrices and rough set data analysis
Ivo D\"untsch, G\"unther Gediga

TL;DR
This paper explores the use of confusion matrices within the rough set data model to evaluate classifiers without relying on distributional assumptions, introducing new indices and classifiers based on rough confusion matrices.
Contribution
It introduces a novel approach combining confusion matrices with rough set theory to assess classifier quality without distributional assumptions.
Findings
Defined indices based on rough confusion matrices
Developed classifiers using rough set data analysis
Provided a framework for classifier evaluation without distribution assumptions
Abstract
A widespread approach in machine learning to evaluate the quality of a classifier is to cross -- classify predicted and actual decision classes in a confusion matrix, also called error matrix. A classification tool which does not assume distributional parameters but only information contained in the data is based on the rough set data model which assumes that knowledge is given only up to a certain granularity. Using this assumption and the technique of confusion matrices, we define various indices and classifiers based on rough confusion matrices.
| True value | |||
|---|---|---|---|
| True | False | ||
| Predicted value | Positive | Positive | |
| False | True | ||
| Negative | Negative | ||
| Type | Price | Guarantee | Sound | Screen | d |
|---|---|---|---|---|---|
| 1 | high | 24 months | Stereo | 76 | high |
| 2 | low | 6 months | Mono | 66 | low |
| 3 | low | 12 months | Stereo | 36 | low |
| 4 | medium | 12 months | Stereo | 51 | high |
| 5 | medium | 18 months | Stereo | 51 | high |
| 6 | high | 12 months | Stereo | 51 | low |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Confusion matrices and rough set data analysis
Ivo Düntsch111The ordering of authors is alphabetical and equal authorship is implied. 222Permanent address: Dept. of Computer Science, Brock University, St Catharines, Canada 3
Günther Gediga
3 College of Mathematics and Informatics, Fujian Normal University, Fuzhou, China
4 Institut für Evaluation und Marktanalysen, Brinkstr. 19, 49143 Jeggen, Germany [email protected],[email protected]
Abstract
A widespread approach in machine learning to evaluate the quality of a classifier is to cross – classify predicted and actual decision classes in a confusion matrix, also called error matrix. A classification tool which does not assume distributional parameters but only information contained in the data is based on rough set data model which assumes that knowledge is given only up to a certain granularity. Using this assumption and the technique of confusion matrices, we define various indices and classifiers based on rough confusion matrices.
1 Introduction
In pattern recognition and other disciplines of machine learning, the sum of the diagonal elements of a confusion matrix is widely used to measure the success of a classification based on an algorithm or human observation in comparison with a gold standard (or “true” measurement) such as classification by an expert. The main idea is that an algorithm (or an observer) forms its own hidden equivalence classes of the data, and is forced to assign the classes to the categories given by the gold standard. The underlying model may be one of a plethora of existing techniques see e.g. [1, 2, 3]. The question may be asked, whether such an index is valid for determining the quality of a classifier: Since we approximate sets, namely, decision classes, one should use a theory of set approximation such as the rough set approach to investigate this question.
In a first step we find a connection of a rough set decision system and a resulting confusion matrix. We derive several approximations of upper and lower bounds of the classes given by the gold standard; additionally, we consider the standard indices of rough set analysis for the coverage. Owing to lack of space we shall only indicate the procedures, and detailed results and proofs will appear elsewhere.
2 Definitions and notation
Throughout, denotes a finite nonempty set with elements. Given a set of decision classes, a classifier is a mapping which predicts the class membership of an element of in a decision class. The predicted and true values of class membership can be cross–classified and counted in a confusion matrix. If success of a classifier is measured by error rate, confusion matrices may be used to analyse and to compare classifiers. A widely used confusion matrix of dimension two is shown in Table 2, and a general confusion matrix is shown in Table 2. An entry in the matrix is the number of elements of which are predicted to be in ; in particular, is the number of correctly classified elements.
The philosophy of rough sets is based on the assumption that knowledge of the world depends on the granularity of representation [4]. Mathematically, granularity may be expressed by an equivalence relation on a nonempty finite set , up to the classes of which membership in a subset of can be determined. For rough approximation, two operators are defined on in the following way: Let be the set of equivalence classes of . If , then,
[TABLE]
The main data type of the rough set approach are decision systems which are closely related to relational data tables with an added decision attribute. An example is shown in Table 4; there, the object set contains six elements, there are four independent attributes, and one decision attribute .
For simplicity of notation, we suppose that an attribute is a mapping from to the set of values of . Each set of independent attributes gives rise to an equivalence relation on by setting if and only if for all . Similarly, the decision attribute induces an equivalence relation , the classes of which are called decision classes. We cross–classify the classes of with the decision classes in a granule frequency matrix, see Table 4; there, , , and . Furthermore, we introduce the following parameters for each decision class :
[TABLE]
Consider the vector belonging to granule . If contains only one non–zero entry, we call the granule deterministic. In this case, and prediction based on is perfect. Otherwise, the granule is called indeterministic. A subset of is called definable, if it is a union of elements of .
A major aim of rough set data analysis is to decide (or estimate) membership of an element of in a decision class using the knowledge given by a set of attributes, in particular, how well the decision classes can be approximated by the knowledge obtained from a partition induced by . Note that we can define a partial classifier as follows: If , then each is correctly classified (and these are the only ones). Thus we can set for all . If and , then the rough method assigns to one ore more upper approximations of decision classes. In this sense, rough approximation is not a point estimate. With some abuse of language, we call a rough classifier.
In the sequel, we suppose that is the set of classes of a fixed equivalence relation on , called granules, and is a set of decision classes; to avoid trivialities we assume that . Lower and upper approximations are taken with respect to , and we shall omit the indices in the approximation functions. We shall write if , and the sets are pairwise disjoint. At times, we are only interested whether the entry in a cell is [math] or not. To this end, we introduce an indicator function defined by
[TABLE]
For the basic philosophy and tools of the rough set method the reader is invited to consult [5]. For recent developments and more advanced methods the overview [6] is an excellent source.
3 Rough confusion matrices
According to the rough set philosophy, we can only distinguish elements of up to equivalence with respect to , hence, we must have for any classifier whenever and are in the same granule. Thus, with some abuse of language, we call a function a (rough) classifier. The meaning of the classifier is that each element of is predicted to be in . Thus, we obtain the predictor sets
[TABLE]
If , then no element of is predicted to be in by any class using . The (rough) confusion matrix of the classifier has dimension , row labels , column labels and, for , the entries
[TABLE]
Thus, . Since is a partition of , for all .
The rough confusion matrix can be obtained in several steps:
Write the granule frequency matrix obtained from and as in Table 4. 2. 2.
Relabel the rows of by by replacing with . 3. 3.
Aggregate the frequencies of the rows with the same label according to (3.2). If , fill the row labeled with [math]s. 4. 4.
Sort the rows according to indices of their labels. The result has the form shown in Table 2.
Example 1**.**
We shall use the decision system of Table 4. Let be the equivalence relation generated by the attributes Price and Screen. The partition generated by has the classes
[TABLE]
and the decision classes
[TABLE]
We define by , and . The construction process is shown in Tables 7, 7, and 7.
Note that classifies five of the six elements of correctly, so that its success ratio is , where as .
According to the rough set philosophy, the set approximates the diagonal set . The optimal approximation would be with ; in this case, is deterministic with respect to . Without knowledge of the source information system, but given the resulting confusion matrix, we obtain only . Similarly, it is easy to see that .
Two statistics are of importance in the rough set literature: The rough approximation quality is the weighted sum
[TABLE]
and the accuracy of approximation of the decision class is defined by the index
[TABLE]
Here, and are precision indices [7]. The measure is the maximal (best possible) value for the approximation quality of the set of an information system which produces the observed confusion matrix.
Note that and the upper bound weighted mean value
[TABLE]
of the are linked by a strictly monotone transformation, since
[TABLE]
Therefore, they are interchangeable as a measure of overall approximation quality.
The – accuracy is connected to the confusion matrix (and not to the underlying information system) by . As is a weighted mean of the and is a strictly monotone function of , we observe that upper confusion and upper confusion are maximal as well.
4 Refining the rough classifier
Thus far, we have put no restrictions on the classifier function . In order to bring the concept closer to rough sets, and use more of the available information, we shall suppose in the sequel that a rough classifier satisfies the condition
[TABLE]
(4.1) implies that at least one element of is classified correctly by . Furthermore,
Lemma 4.1**.**
If , then . 2. 2.
. 3. 3.
If , then for all .
Our first task is to approximate . To this end, we first consider . The cell counts, in particular, the cardinality of the deterministic granules contained in , and thus, . We can further remove certain entries, and define . Using Lemma 4.1 it is not hard, if somewhat tedious, to show the relationships among these indices:
Theorem 4.1**.**
Let . Then,
[TABLE]
Not all of these inequalities need to hold if does not satisfy (4.1).
Turning to upper approximations, we first observe that (4.1) is equivalent to by (2.2), and thus, is a lower bound of the rough upper approximation of , i.e. . This can be sharpened as follows: Set
[TABLE]
A moment’s reflection shows that adds all the cells in the partial granule frequency matrix spanned by the rows where , and adds the entries , where and .
If , then by Lemma 4.1, and therefore, there is some , such that and , i.e. . Therefore, if , there is at least one additional element which is in . Hence, we obtain a sharper bound by setting . Altogether, this leads to the following result:
Theorem 4.2**.**
Let . Then,
[TABLE]
Arguably, the simplest classifier that satisfies (4.1) is a maximal row classifier defined as follows: Consider a granule frequency matrix shown in Table 4. For each choose some such that is maximal in . Such always exists, but the choice need not be unique. Then, set . The classifier satisfies (4.1), and it is well compatible with the rough set philosophy in using only information supplied by the data.
By definition, implies that is a maximum in row . We can use this observation to establish an even sharper upper bound of : Suppose that , and consider the partial granule matrix
[TABLE]
Since a maximum of each row is in column , it follows that for all , and therefore, . Setting we obtain
Theorem 4.3**.**
* for all .*
Finally, we estimate the rough upper bound of using . Setting , it can be shown that
Theorem 4.4**.**
* for all .*
5 Conclusion and outlook
In this note, we have explored a connection between rough set approximation and confusion matrices, and have presented several natural indices that approximate the lower and upper bounds given by the reference standard. Owing to lack of space, we have only indicated the procedures with respect to one observer.
The next step will be to broaden the investigation to two or more observers: Each of these has internal sets and of granules which need to be reconciliated to a common standard. This is related to inter–rater reliability which is a common technique used in psychology (and AI) to gauge agreement among experts. We shall also re–interpret common statistics of rough set analysis based on rough confusion matrices. This will, in some sense, complement our earlier research on precision indices in the rough set framework [8].
References
- [1]
Novaković J, Veljović A, Ilić S, Papić Ž and Tomović M 2017 Theory and Applications of Mathematics & Computer Science 7 39 – 46
- [2]
Hand D J 2005 Applied Stochastic Models in Business and Industry 21 97–109 ISSN 1526-4025
- [3]
Caelen O 2017 Annals of Mathematics and Artificial Intelligence 81 429–450
- [4]
Pawlak Z 1982 Internat. J. Comput. Inform. Sci. 11 341–356
- [5]
Düntsch I and Gediga G 2000 Rough set data analysis: A road to non-invasive knowledge discovery (Bangor: Methodos Publishers (UK))
- [6]
Nguyen H and Skowron A 2013 Rough Sets and Intelligent Systems - Professor Zdzisław Pawlak in Memoriam, Vol 1 ed Skowron A and Suraj Z (Springer Verlag) pp 75–173
- [7]
Gediga G and Düntsch I 2001 Artificial Intelligence 132 219–234
- [8]
Gediga G and Düntsch I 2014 Transactions on Rough Sets Vol. XVII (Lecture Notes in Computer Science vol 8375) ed Peters J and Skowron A (Heidelberg: Springer Verlag) pp 33 – 47
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Novaković J, Veljović A, Ilić S, Papić Ž and Tomović M 2017 Theory and Applications of Mathematics & Computer Science 7 39 – 46
- 2[2] Hand D J 2005 Applied Stochastic Models in Business and Industry 21 97–109 ISSN 1526-4025
- 3[3] Caelen O 2017 Annals of Mathematics and Artificial Intelligence 81 429–450
- 4[4] Pawlak Z 1982 Internat. J. Comput. Inform. Sci. 11 341–356
- 5[5] Düntsch I and Gediga G 2000 Rough set data analysis: A road to non-invasive knowledge discovery (Bangor: Methodos Publishers (UK))
- 6[6] Nguyen H and Skowron A 2013 Rough Sets and Intelligent Systems - Professor Zdzisław Pawlak in Memoriam, Vol 1 ed Skowron A and Suraj Z (Springer Verlag) pp 75–173
- 7[7] Gediga G and Düntsch I 2001 Artificial Intelligence 132 219–234
- 8[8] Gediga G and Düntsch I 2014 Transactions on Rough Sets Vol. XVII ( Lecture Notes in Computer Science vol 8375) ed Peters J and Skowron A (Heidelberg: Springer Verlag) pp 33 – 47
