TL;DR
This paper evaluates the quality of heuristic explanations for ML models, especially boosted trees, revealing their inadequacies compared to rigorous methods across various datasets.
Contribution
It extends previous rigorous explanation methods to boosted trees and assesses heuristic explanation quality, highlighting their limitations.
Findings
Heuristic explanations are often inadequate for entire instance spaces.
Rigorous explanations provide more reliable insights.
Heuristic methods may mislead in model interpretation.
Abstract
Recent years have witnessed a fast-growing interest in computing explanations for Machine Learning (ML) models predictions. For non-interpretable ML models, the most commonly used approaches for computing explanations are heuristic in nature. In contrast, recent work proposed rigorous approaches for computing explanations, which hold for a given ML model and prediction over the entire instance space. This paper extends earlier work to the case of boosted trees and assesses the quality of explanations obtained with state-of-the-art heuristic approaches. On most of the datasets considered, and for the vast majority of instances, the explanations obtained with heuristic approaches are shown to be inadequate when the entire instance space is (implicitly) considered.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
11institutetext: Faculty of Science, University of Lisbon, Portugal {aignatiev,jpms}@ciencias.ulisboa.pt 22institutetext: VMware Research, CA, USA [email protected] 33institutetext: ISDCT SB RAS, Irkutsk, Russia
On Validating, Repairing and Refining
Heuristic ML Explanations
Alexey Ignatiev 1133
Nina Narodytska 22
Joao Marques-Silva 11
Abstract
Recent years have witnessed a fast-growing interest in computing explanations for Machine Learning (ML) models predictions. For non-interpretable ML models, the most commonly used approaches for computing explanations are heuristic in nature. In contrast, recent work proposed rigorous approaches for computing explanations, which hold for a given ML model and prediction over the entire instance space. This paper extends earlier work to the case of boosted trees and assesses the quality of explanations obtained with state-of-the-art heuristic approaches. On most of the datasets considered, and for the vast majority of instances, the explanations obtained with heuristic approaches are shown to be inadequate when the entire instance space is (implicitly) considered.
1 Introduction
Progress in Machine Learning (ML) has motivated efforts towards verifying ML models properties and developing a better understanding of their outcomes. As a result, two concrete lines of research can be broadly identified. One line is concerned with validating and ensuring specific properties of neural networks. Another line is concerned with developing human interpretable explanations for predictions made by ML models. Perhaps unsurprisingly, both lines of research have witnessed a growing use of logic-based methods [26, 22, 44, 47, 25, 24]. The relevance of eXplainable Artificial Intelligence (XAI) is illustrated by a fast growing number of works offering alternatives into computing explanations for ML predictions. More importantly, recent legislation imposes a requirement on the explainability of ML systems [21, 14].
Some ML models are readily interpretable. This is the case with logic-based models, e.g. decision trees, lists or sets [29, 52, 3, 37, 25]. Other ML models are not readily interpretable. This is the case with Neural Networks (NNs), Support Vector Machines (SVMs), and boosted trees, among many others. For models that are not readily interpretable, there has been work on computing one or more explanations given an instance [10, 49, 48, 4, 27, 50, 43, 35, 30, 42, 53, 47, 1, 2]. One well-known approach for computing explanations is heuristic in nature. Such explanations can be described as local, i.e. the computation of an explanation explores locally the instance sub-space close to a given instance. Well-known examples are LIME [39] and, more recently, Anchor [40]. Since these approaches are local in nature, and so do not consider the entire instance space, a natural question is to understand how reliable the computed (local) explanations are. For example, computed local explanations may be too optimistic, in that there could exist instances (in instance space) for which the computed explanation fails to apply, i.e. a different prediction is obtained with the ML model. Alternatively, computed local explanations may be too pessimistic, in that it may be possible to prove that some literals in an explanation are irrelevant and can be dropped. Recent work [40] compares Anchor against LIME, and shows that the former is significantly more accurate than the latter. However, and to our best knowledge, there is no earlier work assessing the quality of the local explanations computed by Anchor or LIME against some (global) reference.
Logic-based approaches have been proposed recently[47, 24]. They provide strong guarantees given that computed explanations hold globally over feature space, in contrast with local explanations computed with heuristic approaches. Shih et al. [47] propose a compilation based approach, representing all prime implicants of the function explaining some prediction. Ignatiev et al. [24] propose to compute prime implicants on demand, by formulating the problem of computing an explanation as abductive reasoning. Whereas the former approach enables aggregated analysis of explanations, the latter approach is expected to scale better, as explanations are computed on demand.
This paper builds on the approach of Ignatiev et al. [24], but investigates instead the computation of global explanations for the concrete case of boosted trees. More importantly, the paper develops solutions for assesssing the quality of local explanations, using boosted trees as a test case. Overall, the paper has three main contributions. First, the paper extends earlier work on finding global explanations [24] to the case of boosted trees computed with XGBoost [9]111XGBoost has achieved significant success in ML challenges hosted by Kaggle., by devising a new constraint-based encoding for boosted trees. As shown in the experiments, computing restricted forms of abduction is far more efficient on the proposed encoding of boosted trees than on the original encoding of NNs [17]. Second, the paper develops algorithms for: (i) assessing the quality of local explanations; (ii) repairing those local explanations when they are optimistic; (iii) refining local explanations in case they are pessimistic. The algorithms have been integrated in the XPlainer XAI tool. Third, the paper conducts the first experimental assessment of the quality of explanations computed by Anchor and LIME in light of global explanations. The paper considers five datasets [16, 3, 40], which are classified with XGBoost [9]. For two datasets, Anchor is optimistic in more than 99% of the instances, meaning that the explanations computed by Anchor fail to apply for instances of input space in more than 99% of the cases. For two other datasets, Anchor is optimistic in more than 80% of the instances. Although the results indicate that LIME is often more pessimistic than Anchor, none of the tools dominates the other in terms of computing optimistic explanations. These results offer more fine-grained insights than earlier comparisons [40]. Depending on the dataset considered, global explanations can be larger than local explanations. This is a necessary result since global explanations are accurate (being either subset- or cardinality-minimal) and so can neither be optimistic nor pessimistic. Furthermore, and for the boosted trees computed with XGBoost [9], the run times of XPlainer are in general comparable to those of LIME and Anchor.
The paper is organized as follows. section 2 introduces the notation and definitions. section 3 develops an encoding for computing global explanations with boosted tree classifiers. section 4 proposes algorithms behind the XPlainer tool. These algorithms include finding subset- and cardinality-minimal (global) explanations, validating heuristic explanation, repairing heuristic explanations in case these are optimistic, and refining heuristic explanations in case these are pessimistic. section 5 analyzes experimental results obtained on five well-known datasets [16, 3, 40]. section 6 offers a brief overview of related work and the paper concludes in section 7.
2 Background
A classification scenario is assumed, with categorical features and prediction classes . Each feature takes values from some domain . (Features need not be categorical, but this assumption simplifies the notation used.) The training data consists of a set of instances, where each instance is taken from the instance space, defined by , and where each instance is associated with some class , taken from , which is referred to as the target prediction given the instance.
Boosted Trees and Explanations.
Boosted trees are one of the most widely used ML models [9]. This paper considers XGBoost [9]. Throughout the paper, the well-known Zoo animal classification dataset222https://www.kaggle.com/uciml/zoo-animal-classification is used as the running example. The result of running XGBoost on this dataset is shown in Figure 1.
A larger number of tree nodes (or even more trees) could be considered for each class. However, for the purposes of illustrating the main ideas in the paper, the simpler version shown suffices. section 3 provides a more detailed account of boosted trees, that will serve to motivate the development of constraint-based encodings.
For the running example, one instance in the dataset is:
[TABLE]
Given the instance above, the execution of Anchor [40] on the model shown produces the following (local) explanation:
[TABLE]
Unfortunately, even for this simple dataset, and considering only the instances in the original dataset, there is at least another instance for which the Anchor explanation also applies, but which the boosted tree predicts a different class:
[TABLE]
By analyzing the weight resulting from each tree, we can conclude that the boosted tree prediction for this instance is indeed amphibian. The remainder of this paper investigates approaches for assessing the quality of the explanations computed by heuristic approaches, like LIME and Anchor, but also for computing (global) explanations.
Logic-Related Concepts.
Definitions standard in first-order logic (FOL) are assumed (e.g. [20]). Given a signature of predicate and function symbols, each of which is characterized by its arity, a theory is a set of first-order sentences over . is extended with the predicate symbol , denoting logical equivalence333Sorts could be used to add rigor to the presentation. However, to keep notation as simple as possible, sorts are omitted.. A model is a pair , where denotes a universe, and is an interpretation that assigns a semantics to the predicate and function symbols of . A set of variables is assumed, distinct from the symbols in . A (partial) assignment is a (partial) function from to . Assignments are represented as conjunctions of literals (or cubes), where each literal is of the form s.t. , . We use cubes and assignments interchangeably. Whenever convenient, cubes are treated as sets of literals. The set of free variables in a formula is denoted by . Assuming the standard semantics of FOL, and given an assignment and corresponding cube , the notation is used to denote that is true under model and cube (or assignment ). In this case, (resp. ) is called a satisfying assignment (resp. cube), and the assignment is partial if (and so if is partial). A solver for a FOL theory is referred to as a -oracle.
The generalization of prime implicants to FOL [33] will be used throughout. Given a FOL formula with a model , a cube is a prime implicant of if: (i) ; and (ii) if is a cube with and , then . A smallest prime implicant is a prime implicant of minimum size. Smallest prime implicants can be related with minimum satisfying assignments [12]. A prime implicant of and given a cube is a prime implicant of such that .
Satisfiability Modulo Theories (SMT) represent restricted (and often decidable) fragments of FOL [5, 6]. All the above definitions apply to SMT. The ML models proposed in this paper exploit the decidable Linear Real Arithmetic (LRA) fragment of FOL [5]. The function symbols are and the predicate symbol is , with the universe being .
Abduction and Prime Implicants.
Given some manifestations (e.g. a prediction), a set of hypotheses (e.g. the given instance), and a background theory (e.g. the encoding of some ML model), abduction is the problem of computing subset-minimal or cardinality-minimal subsets of the hypotheses which are consistent with the background theory and entail the manifestation [13, 45, 23]. The relationship of abduction with prime implicants in the context of computing explanations of ML models was established in earlier work [24]. As a result, this paper considers solely prime implicants as the desired explanations of predictions of ML models. As in earlier work [24], we associate a logic theory with a given ML model , and encode as a formula of . Thus, in contrast with other approaches [39, 40], we must be able to have access to a constraint-based representation (i.e. a formula) of the ML model .
We consider an instance with which a prediction is associated. With a slight abuse of notation, is also used to denote the cube associated with the instance, and is used to denote the literals associated with prediction. The relationship between abductive explanations and prime implicants is well-known (e.g. [33, 34]. Regarding the computation of abductive explanations, and the same holds for any subset of . This means that it suffices to consider the constraint , which is equivalent to . Thus, a subset-minimal explanation (given ) is a prime implicant of (given ), and a cardinality-minimal explanation (given ) is a cardinality-minimal prime implicant of (given ). Hence, we can compute subset-minimal (resp. cardinality-minimal) explanations by computing instead prime implicants (resp. shortest PIs) of . As a final remark, the cardinality minimal prime implicants of are selected among those that are contained in . For instance, assuming a FOL encoding of a boosted tree (this encoding is detailed in section 3), and given the Zoo running example, and the instance yielding the reptile prediction, then we can compute the explanation (as described in section 4):
[TABLE]
We emphasize that this explanation is a prime implicant of . Thus, and by definition, the explanation guarantees that the prediction remains unchanged for any other instance in instance space for which the six literals remain unchanged. A downside is that this explanation can include more literals that the ones computed by Anchor [40].
Computing Abductive Explanations.
Given the formalization above, abductive explanations can be obtained by computing prime implicants. Earlier work [24] outlined two algorithms for computing explanations, based on the extraction of prime implicants and (smallest) prime implicants. The former corresponds to subset-minimal explanations and is shown in algorithm 1. In contrast, the latter (see [24, Algorithm 2]) corresponds to cardinality-minimal explanations. From a computational complexity viewpoint, and assuming as oracle for NP either an ILP or LRA solver, computing subset-minimal explanations is hard for NP, and can be solved with a linear number of calls to an oracle for NP [51] (as shown in algorithm 1). In contrast, computing a cardinality-minimal explanation is (believed to be) harder, being hard for , and can be solved with a linear number of calls to an oracle for [51] (or alternatively, using implicit hitting sets as shown in [24, Algorithm 2]).
3 Encoding Boosted Trees
This section proposes an SMT encoding of an ensemble of decision trees produced by XGBoost algorithm. Suppose our training data is specified over features, , and there are possible classification outcomes. For example, there are 17 features per sample in the Zoo dataset. Features describe characteristics of an animal, e.g. whether an animal lays eggs, the number of legs, etc. There are seven possible outcomes: amphibian, bird, bug, invertebrate, fish, mammal, and reptile (see Figure 1). For simplicity, we assume that all features are binary. We discuss how to extend our encoding to categorical and continuous features in the end of the section.
An XGBoost model is an ensemble of decision trees. A decision tree is a binary tree. A node of a tree is denoted by . We distinguish between non-leaf or internal nodes and leaf nodes. A non-leaf node of a decision tree contains a logical predicate over a feature variable of the form or for short 444XGBoost uses constraints of the form which is equivalent to for binary features.. Outgoing edges of a node are labeled true (the right branch) and false (the left branch). For convenience, we assume that an internal node has two attributes and : stores the predicate of this node and stores the index of the feature variable in . A leaf node contains a numerical value , . We assume that a leaf node has one attribute that stores this value . Consider the first tree in the Zoo example. The tree has three nodes. The root node has the following attributes: and . The first leaf node has one attribute and the second leaf node has one attribute .
The number of trees in the ensemble is equal to the number of classes, , times the number of trees per class, , where is specified by the user. In other words, we have trees, , where the th class is represented by trees, . The depth of a tree can also be specified by the user. In our Zoo example, is one and the depth of the tree is one.
Next, we consider how classification is performed using the XGBoost model. Given a concrete example , we want to know which class it belongs to. To answer this question, we compute the score of the th class for , . W.l.o.g. we discuss how to find the score of the first class. We consider trees that represent the first class in the ensemble , . Each of these trees contributes a score to the total score of the first class. We consider each tree individually. A prediction path of in is a path from the root to a leaf such that for an internal node on the path holds if follows the right branch and does not hold if follows the left branch. The leaf contains the score value. Then we aggregate the result from trees, to get the final score of the st class. We perform the same score computation for all classes. The class with the largest score wins. In general, we can normalize these scores to obtain probabilities but scores are sufficient for our purpose. Consider our running example with the first instance from section 2 classifed as reptile. We have seven trees here, as . So, we get as tail holds, as feathers does not hold. Similarly, we get , , , and . As can be seen, has the largest score so the classification class is “reptile”.
Next, we consider how to encode an XGBoost model into SMT. At a high level, our encoding simulates the scores computation for a possible input. We introduce three sets of variables. For a binary feature we introduce a Boolean variable , . These Boolean variables represent the space of all possible inputs. For the th tree we introduce a real valued variable , . The variable encodes the score contribution from the th tree. Finally, for the th class we introduce one variable , that stores the score of the th class. We connect Boolean variables with predicates: . Then, we encode the score computation for a tree by encoding all paths in as follows. Let be a set of all distinct paths from the root to a leaf in . Consider a path , from the root to a leaf . We recall that if follows the right (left, resp.) branch in a node then the predicate has to hold (to be violated, resp.). Let be a set of nodes where takes the right branch and be a set of nodes where takes the left branch. We enforce the following constraints:
[TABLE]
To compute the score of the th class we add constraints, : .
Consider how the encoding works on the running example. Consider the first tree. For simplicity, we denote the index attribute of the root node as ‘tail’. We have two paths in the tree. For the first path, we add a constraint and for the second path we add a constraint . With one tree per class, we get . Other trees could be encoded similarly.
Next we discuss how to extend our encoding to categorical and continuous data. In case of categorical data, a common approach is to apply one hot encoding to convert discrete values to binary values. This transformation is performed on the original data. Hence, our encoding can be applied directly with a small augmentation. We enforce that exactly one of binary features that encode a categorical feature can be true. The case of continuous features is handled similarly. The main difference is that logical predicates in a node are of the form , where is a constant value. Here, for each predicate that occurs in , we introduce Boolean variable such that . Then the encoding above can be reused.
4 Reasoning about Explanations
This section discusses the practicality of the abduction-based approach [24] and focuses on applying it to explanation of a tree ensemble model using the novel constraints-based encoding proposed above555The ideas of this section will still apply if another encoding of a tree ensemble is considered.. Hereinafter, this approach is referred to as XPlainer. Concretely, the section outlines possible use cases of applying Xplainer in practice: either alone or together with a heuristic explanation approach, e.g. LIME or Anchor.
4.1 Minimal Global Explanations
First of all, the XPlainer approach can be applied directly to computing subset- and cardinality-minimal explanations [24] for boosted trees using the encoding proposed in section 3. Let us apply XPlainer to the running example model shown in Figure 1. Assume that the encoding of the model is represented as a formula , which is the following conjunction of constraints.
[TABLE]
Let be a literal over Boolean variable used above, e.g. is either or . The “translation” of each feature value into the corresponding literal is straightforward666In practice, categorical feature legs should be one-hot encoded. But for the sake of simplicity and without loss of correction, we use a Boolean variable , s.t. is true iff .. Now, for each input defined as a conjunction of literals over , the prediction is determined by the largest score value , , computed using formula . Given the list of class scores , the prediction of class can be guaranteed using a conjunction of linear inequalities enforcing value to be the largest, i.e. with the use of formula . Consider the following input instance and its respective prediction
[TABLE]
Since mammal represents class 6, this prediction can be encoded as the following conjunction of inequalities (observe that they hold for the considered input):
[TABLE]
Let us illustrate the flow of algorithm 1. The algorithm makes calls to a reasoner deciding whether or not a candidate subset of the input instance is a prime implicant of . This is true iff formula is unsatisfiable. Hence, if we fix the values of features in then no misclassification, i.e. , is possible for model .
Clearly, when includes all literals , formula is unsatisfiable. Recall that algorithm 1 iteratively removes literals from and checks whether or not is still unsatisfiable. If it is, is still an implicant of , i.e. literal is not responsible for the prediction . Otherwise, is not an implicant, i.e. is necessary and, thus, must be included in the explanation. This process repeatedly checks all literals .
One straightforward optimization to make before executing algorithm 1 is to discard from all features unused by the model as they cannot affect the prediction. In our example, we can safely remove all features except for the seven features used in the model. This results in . Hence, the first literal to be tested by algorithm 1 is . The reasoning oracle is called to check unsatisfiability of . Note that this formula is indeed unsatisfiable because the largest class score is still , which is enforced by literal . Thus, is not crucial for the prediction and it gets removed from . The second literal to check is . This time, the oracle tests unsatisfiability of and returns true. This means that a misclassification can occur if is discarded. Indeed, since variables and are free, the oracle can assign any values to them, e.g. setting and results in being the largest class score. As a result, literal is vital for the prediction to persist. The algorithm proceeds doing similar checks with respect to all the remaining literals in . As a result, it ends up having , i.e. the explanation contains only one feature.
Observe that explanations computed this way are subset-minimal. Furthermore, since a reasoner deals with the properties of the classifier’s symbolic representation in the complete instance space, these explanations are global, i.e. they hold for the entire space. Given a global explanation for the prediction of a data input , it is guaranteed that there is no point in the instance space s.t. (1) and (2) the prediction for is . Global explanations are significantly more powerful than explanations offered by the state-of-the-art heuristic approaches, e.g. LIME [9] or Anchor [40], since the latter ones hold only for a local neighborhood of a given instance.
As detailed in [24], cardinality-minimal explanations can also be computed, e.g. using the implicit hitting approach [8, 23]. Similarly to the case of subset-minimal explanations, one would need to make a number of similar unsatisfiability calls to a reasoner. However and in contrast to algorithm 1, computing a smallest size explanation is hard for and in the worst case requires an exponential number of iterations.
4.2 Validating Heuristic Explanations
Besides computing global explanations directly, XPlainer can be applied to validating given heuristic explanations. Indeed, one can immediately notice that in order to check the validity of a heuristic explanation for a model formula and data instance classified as , it suffices to do one oracle call similar to the ideas outlined in subsection 4.1. The corresponding procedure is shown in algorithm 2.
As later shown in section 5, this simple and efficient procedure is able to prove or disprove an explanation to be globally correct. (An example of an explanation reported by Anchor and a counterexample computed by algorithm 2 is discussed in section 2.) Since heuristic approaches compute local explanations, it is not surprising that most of them are incorrect from the perspective of the complete instance space (see section 5). An upside of XPlainer is that it can efficiently provide a counterexample to an explanation demonstrating its unsoundness. Moreover, it can be used not only to (in)validate an explanation by providing one counterexample, but it can also enumerate (all or a limited number of) counterexamples showing why and when the explanation is incorrect. Based on such evidence, one can try to devise a way to correct the explanation or compute a better alternative from scratch.
In many settings, computing correct explanations is crucial from a practitioner’s point of view as they are supposed to provide a user with hints of why the model behaves one way or another. These hints should reflect the real properties of the model. If they do not, a comprehensive understanding of the model is infeasible.
4.3 Repairing Heuristic Approaches
If an explanation is proved to be too optimistic, it is often vital to find a way to make a number of (ideally, minimal) changes to the explanation so that it becomes correct in the instance space. An explanation for the prediction of instance is optimistic when the features of do not suffice to guarantee the prediction. A way to repair is to find another subset of features such that is a correct explanation. It is preferred to minimize the “distance” between and . This is another task where XPlainer can help since it deals with a logical representation of the classifier and is able to answer queries about the classifier system and its behavior.
Computing minimum size changes to the explanation is related to identifying minimal inconsistencies and/or diagnoses for a failing system subject to user preferences, which has been studied in prior works [7, 41, 32]. It is known, however, that in a number of settings the latter problem is hard for the second level of the polynomial hierarchy [32]. Therefore, it seems unlikely that given an explanation, one can efficiently extract another one, which would be guaranteed to minimally differ from the original one. However, the problem can be solved heuristically. An approach to this problem using the abilities of XPlainer is shown in algorithm 3. The algorithm follows the procedure of algorithm 1 for extracting subset-minimal explanations. It additionally receives an (invalid) heuristic explanation that is to be repaired. The key idea of the approach is to delay as much as possible the testing of features of while computing a valid explanation. Hence, the algorithm tries to remove as many features from the outside of as possible. Afterwards, it traverses the features of . To emphasize again, algorithm 3 does not guarantee the result explanation to minimally differ from the original explanation. However, an upside of the algorithm is that it does not deal with a problem — instead, it makes a linear number of calls to an NP-oracle, which is practically much more efficient. Having such a repair should suffice in many practical situations.
As an example, recall the pitviper instance and the invalid explanation of Anchor shown in section 2. Anchor claims features , , toothed, and to be responsible for the reptile prediction. section 2 demonstrated that this explanation is invalid by providing a counterexample instance classified by the model as amphibian. Applying algorithm 3 leads to the following correct explanation:
[TABLE]
Note that although this explanation is larger than the one of Anchor, it is global for the entire instance space, i.e. there guaranteed to be no counterexample for this explanation. Also observe that algorithm 3 is able to keep features and in the explanation even though there may be a repair with a fewer number of changes.
4.4 Refining Heuristic Explanations
Yet another way to use XPlainer is to reduce a given explanation if it is proved by algorithm 2 to be valid. Depending on the requirements of a user, this can be achieved by applying either algorithm 1 or [24, Algorithm 2]. Here, the algorithms should receive the explanation instead of complete . As a result, they will output a subset- or cardinality-minimal explanation s.t. .
Although local explanations computed by heuristic approaches are rarely globally correct, this approach is deemed a promising way to prove the explanations to be minimal or to refine them further. Note that due to the complexity of the abduction-based explanation procedures, minimization of a given explanation may be significantly more efficient than starting from a complete data instance: for computing both subset- and cardinality-minimal explanations.
5 Experiments
This section details the experimental results aiming at the assessment of LIME and Anchor, the state-of-the-art heuristic approaches to explaining black-box models. Following section 4, it focuses on validating, repairing, and refining heuristic explanations.
5.1 Datasets and implementation
The results are obtained on the five well-known and publicly available datasets. Three of them were studied in [40] to illustrate the advantage of Anchor’s explanations over those of LIME including adult, lending, and recidivism. These datasets were processed the same way777https://github.com/marcotcr/anchor-experiments as in [40]. The adult dataset [28] is originally taken from the Census bureau and targets predicting whether or not a given adult person earns more than $50K a year depending on various attributes. The lending dataset aims at predicting whether or not a loan on the Lending Club website will turn out bad. The recidivism dataset was used to predict recidivism for individuals released from North Carolina prisons in 1978 and 1980 [46]. Also, two additional datasets were considered including compas and german that were previously studied in the context of the FairML and Algorithmic Fairness projects [15, 18]. Compas represents a popular dataset, known [38] for exhibiting racial bias of the COMPAS algorithm used for scoring criminal defendant’s likelihood of reoffending. The latter dataset is a German credit data (e.g. see [16]), which given a list of people’s attributes classifies them as good or bad credit risks.
A prototype of XPlainer (including the proposed encoding of boosted trees and the explanation procedures) is implemented in Python888XPlainer is available online: https://github.com/alexeyignatiev/xplainer. Extraction of subset- and cardinality-minimal explanation follows algorithm 1 and [24, Algorithm 2], respectively. XPlainer makes use of SMT solver Z3 [36] as an underlying reasoning engine.
5.2 Results
The performed experiment is detailed below. First, following the standard setup, given a dataset, each XGBoost model was trained on 80% randomly chosen data instances. Each XGBoost model contained 50 trees per class, each tree having depth 3. (Further increasing the number of trees per class and also increasing the maximum depth of a tree does not result in a significant increase of the models’ accuracy on the training and test sets for the considered datasets.) Second, given a dataset and the trained model, an explanation for each of the unique data instances999Datasets normally contain duplicate instances. Moreover, various predictions can be specified in the dataset for different instantiations of the same input. As long as the classifier is trained, it behaves the same way for each of the duplicates. As a result and in order to avoid unnecessary repetition, each unique instance was considered once. was computed using either LIME101010LIME expects to receive a target size of an explanation provided as input. Hence, the experiment bootstrapped LIME with the size of an existing subset-minimal explanation computed by XPlainer. or Anchor. Third, each explanation was then validated by XPlainer (see subsection 4.2). If an explanation was proved to be incorrect, i.e. optimistic, XPlainer made an attempt to heuristically repair the explanation (see subsection 4.3). Otherwise, Xplainer tried to refine the explanation further (see subsection 4.4). If succeeded, the explanation was treated as pessimistic. Otherwise, the explanation was reported to be correct and subset-minimal, i.e. realistic from the global perspective.
The results of this experiment are shown in Table 1. Although Anchor is supposed to improve over LIME [40], surprisingly, there is no clear winner between LIME and Anchor; most explanations computed by either approach are inadequate. Observe that for the 4 out of 5 datasets the explanations of both LIME and Anchor are mostly optimistic. Concretely, for recidivism and german more than 99% of Anchor’s explanations are optimistic. Similar results hold for LIME, i.e. 94.1% and 85.3% explanations for recidivism and german are optimistic. The quality of Anchor’s explanations improves for adult and compas where there are more than 80% of optimistic explanations. LIME is ahead of Anchor with 61.3% and 71.9% explanations being optimistic for adult and compas. Surprisingly, the result for the lending dataset does not agree with the rest, where only 3% (24%, resp.) of inputs are explained incorrectly by Anchor (LIME, resp.). Overall, 80.5%, 3.0%, 99.4%, 84.4%, and 99.7% of Anchor’s explanations and 61.3%, 24.0%, 94.1%, 71.9%, 85.3% of LIME’s explanations are optimistic (computed for adult, lending, recidivism, compas, and german, respectively). Also note that the number of pessimistic explanations is significantly lower for Anchor. Usually, there are less than 1.7% of explanations that can be further refined. However, LIME can produce a significant number of them, e.g. for adult, compas, and german the percentage of pessimistic explanations reaches 7.9%, 20.6%, and 14.6%, respectively.
Explanations for the remaining data instances were proved to be correct and subset-minimal (see column marked by realistic). For Anchor, these comprise 17.9%, 97.0%, 0.2%, 13.9%, 0.1% of inputs for adult, lending, recidivism, compas, and german, respectively. For LIME, the percentage of realistic explanations is 30.8%, 75.6%, 1.3%, 7.5%, and 0.1% for adult, lending, recidivism, compas, and german, respectively.
To conclude, there are cases when Anchor and LIME behave reasonably well, e.g. for the lending dataset. However, as the Table 1 indicates, in most situations the explanations provided by both heuristic approaches are either globally incorrect (optimistic) or can be further refined (pessimistic).
Contribution of LIME, Anchor, and XPlainer (including the validation, repair, and refinement time) to the average total runtime for each data instance is shown in Table 2. Observe that validation time is usually negligible. Also, repairing and refining heuristic explanations in XPlainer’s subset-minimal mode is consistently faster than in the cardinality-minimal mode. This is especially the case for the german dataset, which is the hardest for XPlainer to deal with.
Now, let us compare the size of explanations produced by Anchor and XPlainer121212LIME is not shown here as LIME’s explanations and the subset-minimal ones are equal in size.. Table 3 details the comparison showing the minimum, maximum, and mean values, as well as standard deviation for the explanations computed by Anchor and also subset- and cardinality-minimal explanations computed by XPlainer. Here, given an explanation of Anchor, XPlainer was instructed to either repair or refine it. It is not surprising that the mean value for the size of Anchor’s explanations is typically lower, because, as was shown above, most of the time Anchor’s explanations are globally optimistic. In general, the average size of Anchor’s explanations varies from 10% to 23% of the total number of features. The average size of subset-minimal explanations is 15–50%, which is still quite good in terms of interpretability. Furthermore, cardinality-minimal explanations improve this result to 15–36% of features on average. However and as the experimental results confirm (see Table 2), computing a cardinality-minimal explanation is computationally more expensive. This represents a reasonable trade-off: depending on user’s requirements, XPlainer can be applied to compute a subset-minimal explanation (faster but worse quality) or to compute a cardinality-minimal explanation (slower but better quality).
6 Related Work
The importance of providing explanations for predictions made by ML models has grown in significance in recent years, motivated both by ongoing research programs [11], but also by recently approved legislation [14, 21]. Nevertheless, the importance of explanations can be traced until the mid 90s [10]. The computation of explanations can be broadly organized into two main categories, depending on whether the ML model considered is interpretable or not. An ML model is viewed as interpretable if it is amenable to interpretation by a human decision maker. This is the case with decision trees, lists or sets. When considering interpretable ML models, the goal is then to compute models that provide minimal explanations associated with each prediction. A number of works has addressed this topic recently [29, 3, 37, 25]. The work on generating explainable (interpretable) models can be further organized into heuristic approaches (e.g. [29]) and exact solutions [3, 37, 25]. Clearly, a limitation of these approaches is that they are restricted to interpretable ML models, which in many settings are not the preferred choice. A different alternative consists of (heuristically) compiling a non-interpretable ML model into another (interpretable) one [19], but an assessment from a (global) quality viewpoint is unavailable. Recent compilation-based approaches for computing global explanations consider Bayesian network classifiers [47], with the drawback of exponential worst-case compilation sizes. For non-interpretable models, one line of work is based on sensitivity analysis [4, 35], on the use of simulated annealing [50], or on the use of case-based reasoning [30]. Recent methods attempt to improve interpretability of non-interpretable models by analysis of the model after training. Recent work reached conclusions similar to ours with respect to saliency methods [1]. With few exceptions, existing approaches [49, 48, 4, 27, 50, 43, 35, 30, 42, 53, 47, 1, 2] are local in nature, and although some are efficient in practice, computed explanations offer no global guarantees similar to the ones provided by XPlainer. Exact compilation approaches [47] are one such exception, but also exhibit similar (if not worse) concerns in terms of scalability.
7 Conclusions
This paper extends earlier work on computing provably correct explanations, by considering the concrete case of boosted trees [9]. The proposed encoding is shown to scale to realistic sized boosted trees, either for computing subset-minimal and cardinality-minimal (correct, global) explanations. In turn, this enabled a first assessment of recently proposed heuristic approaches for computing explanations [39, 40]. On the datasets considered, the results are conclusive and indicate that existing heuristic approaches may be either too optimistic, thus, overlooking feature values that are necessary to provide a global explanation of a prediction, or pessimistic, i.e. containing a number of redundant feature values.
A possible downside of the proposed approach is scalability. The NP-hardness of finding subset-minimal (and the -hardness of finding cardinality-minimal [51]) explanations is likely to limit the applicability of XPlainer. Nevertheless, as the results demonstrate, XPlainer is well-suited to assess the quality of existing and new heuristic approaches, on small to medium-scale ML models.
Given the experimental results in this paper, one line of work is to devise more robust heuristic approaches for explaining non-interpretable ML models. Another line of work is to assess other heuristic approaches for explaining ML models, e.g. [4, 31, 30]. Although not a concern for the ML models studied in this paper, a third line of work is to improve the underlying reasoning engine(s) and the proposed encodings, aiming at better scalability of the (provably correct) explanations obtained with XPlainer on more complex ML models. One final line of work is to extend XPlainer to other non-interpretable ML models.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Adebayo, J., Gilmer, J., Muelly, M., Goodfellow, I.J., Hardt, M., Kim, B.: Sanity checks for saliency maps. In: Neur IPS. pp. 9525–9536 (2018)
- 2[2] Alvarez-Melis, D., Jaakkola, T.S.: Towards robust interpretability with self-explaining neural networks. In: Neur IPS. pp. 7786–7795 (2018)
- 3[3] Angelino, E., Larus-Stone, N., Alabi, D., Seltzer, M., Rudin, C.: Learning certifiably optimal rule lists. In: KDD. pp. 35–44 (2017)
- 4[4] Baehrens, D., Schroeter, T., Harmeling, S., Kawanabe, M., Hansen, K., Müller, K.: How to explain individual classification decisions. Journal of Machine Learning Research 11 , 1803–1831 (2010)
- 5[5] Barrett, C., Tinelli, C.: Satisfiability modulo theories. In: Handbook of Model Checking., pp. 305–343 (2018)
- 6[6] Barrett, C.W., Sebastiani, R., Seshia, S.A., Tinelli, C.: Satisfiability modulo theories. In: Biere, A., Heule, M., van Maaren, H., Walsh, T. (eds.) Handbook of Satisfiability, Frontiers in Artificial Intelligence and Applications, vol. 185, pp. 825–885. IOS Press (2009)
- 7[7] Boutilier, C., Brafman, R.I., Domshlak, C., Hoos, H.H., Poole, D.: Cp-nets: A tool for representing and reasoning with conditional ceteris paribus preference statements. J. Artif. Intell. Res. 21 , 135–191 (2004)
- 8[8] Chandrasekaran, K., Karp, R.M., Moreno-Centeno, E., Vempala, S.: Algorithms for implicit hitting set problems. In: SODA. pp. 614–629 (2011)
