The Generalizability of Explanations

Hanxiao Tan

arXiv:2302.11965·cs.AI·May 1, 2024

The Generalizability of Explanations

Hanxiao Tan

PDF

Open Access

TL;DR

This paper introduces a new evaluation method for explainability techniques based on their generalizability, using an Autoencoder to assess the learnability and plausibility of explanations, and shows smoothing improves this property.

Contribution

It proposes a novel generalizability-based evaluation framework for explainability methods utilizing Autoencoders, addressing the lack of ground truth in explainability evaluation.

Findings

01

Smoothing explanations with SmoothGrad improves their generalizability.

02

The Autoencoder-based approach effectively evaluates multiple explainability methods.

03

The methodology provides a new perspective on explainability evaluation beyond human and sensitivity tests.

Abstract

Due to the absence of ground truth, objective evaluation of explainability methods is an essential research direction. So far, the vast majority of evaluations can be summarized into three categories, namely human evaluation, sensitivity testing, and salinity check. This work proposes a novel evaluation methodology from the perspective of generalizability. We employ an Autoencoder to learn the distributions of the generated explanations and observe their learnability as well as the plausibility of the learned distributional features. We first briefly demonstrate the evaluation idea of the proposed approach at LIME, and then quantitatively evaluate multiple popular explainability methods. We also find that smoothing the explanations with SmoothGrad can significantly enhance the generalizability of explanations.

Tables3

Table 1. Table S1: Detailed quantitative results of the proposed evaluation method for LIME with different n _ s a m p l e 𝑛 _ 𝑠 𝑎 𝑚 𝑝 𝑙 𝑒 n\_sample . From top to bottom are the average of: Top-k Accuracy, Spearman’s and Pearson’s coefficients, the Distribution Learnability, the difference of Spearman’s coefficient and Fréchet inception distance between intra and inter-class, the Variance Proximity and the final score of the generalizability.

$n _ s a m p l e$	10	30	50	100	500
$\bar{T A}$	$26.7$	$44.9$	$51.4$	$58.2$	$63.7$
$\bar{S C}$	$0.032$	$0.227$	$0.306$	$0.381$	$0.448$
$\bar{P C}$	$0.010$	$0.347$	$0.454$	$0.532$	$0.574$
DL	$0.103$	$0.341$	$0.425$	$0.498$	$0.553$
$Δ \bar{S C_{a}^{r}}$	$0.019$	$0.120$	$0.125$	$0.111$	$0.124$
$Δ \bar{F I D_{a}^{r}}$	$0.297$	$0.757$	$0.892$	$0.889$	$0.849$
VP	$0.316$	$0.877$	$1.017$	$1.000$	$0.973$
$S_{E M}$	$0.031$	$0.296$	$0.429$	$0.498$	$0.536$

Table 2. Table S2: Detailed quantitative results of the proposed evaluation method for popular explainability methods. From left to right are: Vanilla Gradients, Guided Backpropagation, Input × \times Gradients, Integrated Gradients, Layer-wise Relevance Propagation, DeepLift, LIME, KernelSHAP and random generated explanations.

	V	GB	IxG	IG	LRP	DPL	LIME	KSHAP	Random
$\bar{T A}$	$0.541$	$0.513$	$0.515$	$0.543$	$0.515$	$0.534$	$0.582$	$0.558$	$0.250$
$\bar{S C}$	$0.727$	$0.402$	$0.232$	$0.254$	$0.234$	$0.249$	$0.381$	$0.362$	$- 8 e - 5$
$\bar{P C}$	$0.596$	$0.499$	$0.403$	$0.441$	$0.409$	$0.449$	$0.532$	$0.496$	$2 e - 4$
DL	$0.622$	$0.471$	$0.383$	$0.413$	$0.386$	$0.411$	$0.498$	$0.472$	$0.083$
$Δ \bar{S C_{a}^{r}}$	$0.025$	$0.193$	$0.111$	$0.107$	$0.110$	$0.105$	$0.109$	$0.140$	$0.035$
$Δ \bar{F I D_{a}^{r}}$	$0.364$	$0.508$	$0.401$	$0.378$	$0.418$	$0.386$	$0.854$	$0.876$	$0.204$
VP	$0.389$	$0.701$	$0.512$	$0.485$	$0.528$	$0.491$	$0.963$	$1.016$	$0.239$
$S_{E M}$	$0.241$	$0.330$	$0.196$	$0.200$	$0.203$	$0.201$	$0.479$	$0.479$	$0.019$

Table 3. Table S3: Detailed quantitative results of the proposed evaluation method for Vanilla Gradients, Input × \times Gradients, Integrated Gradients and their SmoothGrad versions.

	$V$	$V _ s$	$I \times G$	$I \times G _ s$	$I G$	$I G _ s$
$\bar{T A}$	$0.541$	$0.629$	$0.515$	$0.488$	$0.543$	$0.508$
$\bar{S C}$	$0.727$	$0.847$	$0.232$	$0.332$	$0.254$	$0.351$
$\bar{P C}$	$0.596$	$0.782$	$0.403$	$0.452$	$0.441$	$0.473$
DL	$0.622$	$0.752$	$0.383$	$0.424$	$0.413$	$0.444$
$Δ \bar{S C_{a}^{r}}$	$0.025$	$0.030$	$0.111$	$0.299$	$0.107$	$0.304$
$Δ \bar{F I D_{a}^{r}}$	$0.364$	$0.654$	$0.401$	$0.681$	$0.378$	$0.656$
VP	$0.389$	$0.684$	$0.512$	$0.980$	$0.485$	$0.960$
$S_{E M}$	$0.241$	$0.514$	$0.196$	$0.415$	$0.200$	$0.426$

Equations19

∥ F (H, x_{i}) - F (H, (x_{i} + ϵ) ∥ \leq L ∥ x_{i} - (x_{i} + ϵ) ∥

∥ F (H, x_{i}) - F (H, (x_{i} + ϵ) ∥ \leq L ∥ x_{i} - (x_{i} + ϵ) ∥

ρ_{R} = \frac{C O V ( R ( P ) , R ( P ^{'} ))}{σ _{R (P)} σ _{R (P^{'})}}

ρ_{R} = \frac{C O V ( R ( P ) , R ( P ^{'} ))}{σ _{R (P)} σ _{R (P^{'})}}

T A = \frac{∣ T o p K ( P ) \cap T o p K ( P ^{'} ) ∣}{∣ T o p K ( P ^{'} ) ∣}

T A = \frac{∣ T o p K ( P ) \cap T o p K ( P ^{'} ) ∣}{∣ T o p K ( P ^{'} ) ∣}

S S I M (P, P^{'}) = \frac{( 2 μ _{P} μ _{P^{'}} + c _{1} ) ( 2 σ _{P P^{'}} + c _{2} )}{( μ _{P}^{2} + μ _{P^{'}}^{2} + c _{1} ) ( σ _{P}^{2} + σ _{P^{'}}^{2} + c _{2} )}

S S I M (P, P^{'}) = \frac{( 2 μ _{P} μ _{P^{'}} + c _{1} ) ( 2 σ _{P P^{'}} + c _{2} )}{( μ _{P}^{2} + μ _{P^{'}}^{2} + c _{1} ) ( σ _{P}^{2} + σ _{P^{'}}^{2} + c _{2} )}

d_{F} (L_{P} (μ_{P}, σ_{P}), L_{P^{'}} (μ_{P^{'}}, σ_{P^{'}})) = ∥ μ_{P} - μ_{P^{'}} ∥_{2}^{2}

d_{F} (L_{P} (μ_{P}, σ_{P}), L_{P^{'}} (μ_{P^{'}}, σ_{P^{'}})) = ∥ μ_{P} - μ_{P^{'}} ∥_{2}^{2}

+ t r (σ_{P} + σ_{P^{'}} - 2 (σ_{P}^{\frac{1}{2}} \cdot σ_{P^{'}} \cdot σ_{P}^{\frac{1}{2}})^{\frac{1}{2}})

V P = Δ \overline{S C_{a}^{r}} + Δ \overline{F I D_{a}^{r}}

V P = Δ \overline{S C_{a}^{r}} + Δ \overline{F I D_{a}^{r}}

Δ \overline{S C_{a}^{r}} = \overline{S C_{a}} - \overline{S C_{r}}

Δ \overline{S C_{a}^{r}} = \overline{S C_{a}} - \overline{S C_{r}}

Δ \overline{F I D_{a}^{r}} = (\overline{F I D_{r}} - \overline{F I D_{a}}) / \overline{F I D_{a}}

Δ \overline{F I D_{a}^{r}} = (\overline{F I D_{r}} - \overline{F I D_{a}}) / \overline{F I D_{a}}

S_{E M} = V P \times D L

S_{E M} = V P \times D L

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Machine Learning and Data Classification

MethodsLocal Interpretable Model-Agnostic Explanations

Full text

The Generalizability of Explanations

Hanxiao Tan

AI Group

TU Dortmund

[email protected]

Abstract

Due to the absence of ground truth, objective evaluation of explainability methods is an essential research direction. So far, the vast majority of evaluations can be summarized into three categories, namely human evaluation, sensitivity testing, and salinity check. This work proposes a novel evaluation methodology from the perspective of generalizability. We employ an Autoencoder to learn the distributions of the generated explanations and observe their learnability as well as the plausibility of the learned distributional features. We first briefly demonstrate the evaluation idea of the proposed approach at LIME, and then quantitatively evaluate multiple popular explainability methods. We also find that smoothing the explanations with SmoothGrad can significantly enhance the generalizability of explanations.

1 Introduction

As the performance of machine learning models has grown dramatically in recent years, they are increasingly being deployed in a wide range of fields. However, the potential dangers regarding trustworthiness are stepping into the limelight. For instance, in the medical domain, where decisions directly affect human lives, most black-box models are incapable of shouldering this responsibility. The prevailing solutions are twofold: inherently interpretable models and post-hoc explainability methods. For the former, interpretable models are typically simple in structure such that they are inadequate for decision-making tasks of growing complexity [1]. Although there are researchers who argue that well-screened features facilitate both interpretability and performance [2], traditional machine learning appears to be incompetent in the image or high-dimensional visual tasks. Post-hoc explainability approaches do not have excessive requirements for the internal architecture of the model and can provide users with convincing explanations while assuring the decision performance.

However, the current issue with post-hoc explainability methods is that the quantitative evaluation is challenging. Early studies evaluated explanations by collecting a large number of human scoring (user studies) [3, 4]. While user studies are human friendly, these approaches are costly and the experimental results are unreproducible due to the subjective nature of humans [5]. A viable alternative is quantitative evaluation, i.e. automating the computations to achieve objective assessments. However, the greatest obstacle to objective evaluations is the absence of ground truth explanations. Existing studies have proposed available evaluation methods which verify the properties that a reasonable explanation should possess [6, 7, 8]. These properties can be divided into two categories: the sensitivity to 1) the perturbations of the inputs [6] and 2) the randomization of the model parameters (salinity checks) [7]. The former is the most popular candidate, and the assumption is that perturbing important features results in a rapid decline in prediction confidence, and vice versa. Nevertheless, skeptics argue that the forced perturbation corrupts the data distribution (which the model never learns) and therefore cannot be assessed with the prediction confidence [8]. In addition, it has been considered that the flip-flop perturbation does not take into account the correlations between features [9]. The logic of the salinity test is that a plausible explanation is necessarily model dependent. However, it is not sufficient as an evaluation metric for explanations: most of the model-related outputs can pass the salinity check, although they apparently do not serve as explanations.

This work proposes a novel approach to evaluate explanations from the perspective of generalizability. Our assumption is that a plausible explanation should share a proximate distribution with the original data and we train a generative model that is adequate for the complexity of the data and observe the performance of reconstructing the explanations. Our approach is intuitive, applicable to all saliency map-like explanations, and without requirements for the type of explainability methods (both for gradient- and perturbation-based). Our contributions are as follows:

•

We propose a novel method for evaluating explanations that validates whether they possess information consistent with the original data from the perspective of generalizability.

•

We evaluate popular saliency map-based explainability methods.

•

We show the interesting observation that SmoothGrad can effectively enhance the generalizability of explanations.

The structure of this paper is as follows: In Section 2 we present the relevant studies, in Section 3 we detail the proposed approach, and in Section 4 we show the experimental results. Finally, we conclude and describe future work in Section 5.

2 Related Work

In this section, we introduce popular explainability methods, as well as existing approaches for evaluating explanations.

Explainability Methods: Prevailing explainability methods are broadly split into two categories, i.e., gradient-based and perturbation-based. Gradient-based explanations originate from Vanilla Gradients [10] and demonstrate the importance by calculating the gradient of an output neuron with respect to the input. Improved variants have been proposed for the flaws of the pioneer, e.g., [11] argued that Vanilla Gradients suffers from gradient saturation, thus Integrated Gradients [12] is proposed, which computes the integration of gradients starting from a selected uninformative baseline. [13] smooths the appearance of the explanations by introducing Gaussian noise to adjacent pixels. [6] proposed Layer-wise Relevance Propagation (LRP) that follows specific rules to propagate importance backwards from the output layer to the input. In addition, there are studies that employ the trick for visually optimizing the explanations, such as [14, 15]. Perturbation-based approaches mostly utilize surrogate models that substitute the original black box model with an interpretable linear one, and sample training data locally in the neighborhood of the instance to be explained so that the performance of the surrogate model is similar to that of the original one. Representatives include LIME [16] and KernelSHAP [17], which differ in sampling weights and perturbation patterns. Subsequently, numerous variants are proposed to refine the quality of the explanations, such as [18] and [5]. In addition, instance-based explainability methods are widely adopted as well, such as counterfactuals [19] and activation maximization [20].

Evaluation Approaches: There are two existing mainstream evaluation methods, the sensitivities to input perturbations and to model parameters [21]. The former operates by perturbing the features of the input and observing the changes in the prediction. For a plausible explanation, perturbations to the feature with the largest attribution lead to severe corruption in the prediction confidence. This approach is intuitive and therefore widely applied to justify the reliability of explanations [6, 22, 23, 24, 25, 26]. However, it has been scepticized that such perturbations neglect the correlation between features (pixels) or feed a distribution that the model has never learned before, which impairs the reliability of the evaluation [27, 8]. [8] proposes RemOve And Retrain (ROAR), which retrains the model with the perturbed dataset and observes the magnitude of the performance degradation, to some extent eliminating the out-of-distribution issue of the perturbed data.

The sensitivity of the model parameters originates from a concern about the sanity of the explanations. [28, 7] reveal that by randomizing the parameters of the model, several explainability methods remain unaffected, which raises questions about whether the explanations are faithful to the model.

Besides, several other approaches are proposed, such as Pointing Game [29], which counts the number of times the point with the largest attribution in the explanation is inside the target object in the image. User study is also an important methodology to assess explanations. Although the explanation needs to be human-friendly, however, this method is costly and subjective, which may not reveal the true basis of the prediction [5]. Moreover, several studies raise concerns about the robustness or stability of explainability methods [22, 30, 23]. Although they experimentally demonstrate certain deficiencies of explanations, such as the lack of linear invariance [23] and Lipschitz continuity [22], they cannot be used as quantitative evaluation approaches.

3 Methods

In this section we present a novel approach to evaluating explanations. Unlike existing intuitive perturbation-based sensitivity measures, this method targets the generalizability of the explanations across the entire data set.

Consider a image dataset $X=\left\{x_{1},...,x_{n}\right\}\subseteq R^{w\times h}$ with a well-trained model $H(\cdot)$ . We obtain the explanation set $P=\left\{p_{1},...,p_{n}\right\}\subseteq R^{w\times h}$ by the explainability method $F$ . The explainability approach can be regarded as a mapping function, i.e. $p_{i}=F(H,x_{i})$ (only local methods are considered here). Therefore, there should be a certain degree of distributional similarity between $X$ and $P$ : $X\mathrel{\dot{\sim}}P$ . Intuitively, this can be understood as "proximity", i.e., similar inputs should yield similar explanations and vise versa. Existing studies had mentioned similar concerns, they perturb the input and measured the local continuity of the explanations with Local Lipschitz Continuity, which is formulated as:

[TABLE]

where $\epsilon$ is the perturbation matrix and $L$ is a constant. However, we argue that the perturbations may disrupt the data distribution. Analogous to sensitivity tests [8], out-of-distribution data that are never seen by the model may may impair the evaluation performance.

Our approach involves the entire process of explanation generation. For a well-trained model $H$ , a certain rule $R_{m}$ should exist with regard to the prediction, for instance, when multiple horse images are input, the model takes similar features as prediction bases(e.g., legs or tails). For a plausible explainability method $F$ , it should exhibit the rule in the explanations. However, $R_{m}$ is agnostic due to the opaqueness of the black box model. Therefore, under the premise that the model is well-trained, we can verify whether $F$ is reliable by verifying $R_{m}$ . Inspired by early researches, which have shown that a network with sufficient parameters can fit any function [31, 32], we train an appropriately structured generative neural network $G$ to simulate $R_{m}$ . $G$ takes the original images $x_{i}$ as input and outputs the corresponding explanation $p^{\prime}_{i}$ ( $p^{\prime}_{i}\approx p_{i}=F(H,x_{i})$ ), and the reliability of the explainability method $F$ is evaluated by observing the performance of $G(X)$ on the whole dataset. The evaluation via G is based on two factors:

•

Generalizability: $F$ should not be erratic. If the performance of the trained G is inferior, it indicates that $F$ is unlearnable and thus erratic. For example, $G$ cannot learn any rule from randomly distributed saliency maps, which leads to a nearly non-decreasing loss curve. Note that the reverse of this factor does not hold. One trick is to generate simple and identical explanations for all inputs, which can be easily learned by the network, whereas they cannot be considered as reliable explanations.

•

Distributional proximity: As a complement to the previous point, $P^{\prime}$ should possess similar statistical properties as $X$ . Intuitively, the gap between explanations in the same class should be much smaller than that between different classes. This complement prevents the aforementioned "simple and identical" explanations from being regarded as plausible, since they are not consistent with the distribution of the original images.

Subsequently, we elaborate on the details of the proposed evaluation approach. The general structure of the approach can be seen in Fig. 1. Our evaluation method consists of two components, which verify generalizability and proximity, respectively.

Distribution Learnability: The essential task of the simulatin g model is to learn the rules of the explainability method. Therefore, the model requires sufficient learning capability (i.e., structural complexity). We try an Autoencoder with different architectures until it can reconstruct the input image with high quality ( $L_{1}$ loss $\approx 0.01$ , $L_{2}$ loss $\approx 0.001$ and Structural Similarity Index Measure (SSIM) $>0.99$ , the detailed architecture can be seen in Fig. S1 and the qualitative results are shown Fig. S2). We train this autoencoder with the original images and the explanations as inputs and labels, respectively, with $L1$ as the target loss function until the loss curve converges. We denote this Autoencoder as $AE_{i}$ for easy reuse in subsequent processes. We divide the dataset $X$ into training ( $X_{tr}$ ) and test sets ( $X_{te}$ ), and we train the Autoencoder corresponding to each explainability method (denoted as $AE_{F}$ , where $F$ represents a certain explainability method) with $X_{tr}$ according to the framework in Fig. S1, and then input the $X_{te}$ into $AE_{F}$ to obtain the reconstructed explanations $P^{\prime}_{te}$ . According to the reconstruction performance the plausibility of the explanations can be observed: Autoencoder is more likely to reconstruct those explanations that possess more typical rules, while on the contrary, erratic and random explanations cannot be well-learned. We show an example for the generalizability evaluation in Section 4.1.

To quantitatively assess the learnability of the explanations, we calculate the reconstruction performance of $P^{\prime}_{te}$ by six different measurements:

L1&L2: The L1 and L2 distances provide an intuitive indication of the pixel-wise similarity, however they may be affected by extreme values and those points with little attribution occupy the same weight.

Pearson & Spearman’s rank correlation coefficient: Pearson correlation coefficient (PC) measures the linear correlation of two explanations, which is formulated as $\rho=\frac{COV(P,P^{\prime})}{\sigma_{P}\sigma_{P^{\prime}}}$ . When comparing explanations, one may focus more on the ranking of feature attributions than on the specific values. Spearman’s correlation coefficient (SC) is the ranked version of the Pearson correlation coefficient, which can be formulated as:

[TABLE]

where $R(*)$ is the rank function, COV is the covariance and $\rho$ is standard deviation.

Top-k Accuracy (TA): For explanations, humans tend to be concerned only with the features that yield larger attributions. Therefore, we perform a Pointing Game on the pixels with the top-K attributions. Specifying a percentage K (all $25\%$ in the experiment), we calculate how many pixels in $P_{te}$ ranked in the top-K attributions appear in the top-K attributions of the pixels in $P^{\prime}_{te}$ . It can be formulated as:

[TABLE]

TA is in the range $[0,1]$ , with higher values representing that the pixels with large attribution are more similar in the two explanations. Note that the TA of two sufficiently large random sequences converge to $K$ (0.25 in this experiment).

Structural Similarity Index Measure (SSIM): In particular, for images, we introduce SSIM as a similarity measure, which measures the statistical likelihood of two explanations regarding the mean and variance in terms of luminance, contrast and structure. SSIM is formulated as:

[TABLE]

where $\mu$ and $\sigma$ denote the mean and (co)variance, respectively. Note that the core of SSIM is the approximation of the statistics, existing studies show that it is problematic to assess the perceptual proximity [33], so we only consider it as a reference.

Variance Proximity: Good generalizability is not sufficient to conclude the plausibility of the explanations. Simple rules can be easily learned by the Autoencoder (e.g. by focusing the attribution on a fixed pixel), however they may not provide reliable explanations. Therefore, we exclude these uninformative explanations by evaluating the proximity of the distributions between the reconstructed explanations and the original images. Our assessment is based on the idea that the discrepancies of the reconstructed explanations between the identical classes should be much smaller than those between different categories. In this regard we consider two proximity measurements, the pixel-wise ranking discrepancy and the latent distance. For the former, calculating the absolute value is insignificant, since the reconstructed explanations may not be in the same order of magnitude as the original image. Therefore, we also employ Spearman’s rank correlation coefficient to measure pixel-wise proximity. For the latter, we constructed a latent distance measurement for validation with the pixel-wise approach, which can be considered as a variant of Fréchet inception distance (FID). We take the encoder of the original Autoencoder ( $AE_{i}$ ) as the "Inception" network, input the images and explanations respectively to obtain the latent vectors, and compute their 2-Wasserstein distances (both are considered as multidimensional Gaussian distributions)

[TABLE]

In measuring the reconstructed similarity within classes, we take a certain number of samples ( $S$ ) from each class for a pairwise comparison. For inter-class, we try all combinations of different classes and take $S$ samples from each class for a pairwise comparison. For quantitative comparison, our final Variance Proximity scores (VP) can be calculated as:

[TABLE]

where $\Delta{\overline{SC_{a}^{r}}}$ is the difference between the mean of the inter- and intra-class Spearman coefficients, which can be formulated as:

[TABLE]

and $\Delta{\left\|\overline{FID_{a}^{r}}\right\|}$ denotes the (normalized) difference between the averages of the inter- and intra-class FIDs, which can be expressed as:

[TABLE]

Finally, the quantitative score of corresponding explainability methods is:

[TABLE]

where VP is the variance proximity (see equation 6), and DL denotes the Distribution Learnability, which is the average of Top-k Accuracy (TA), Spearman’s Correlation (SC) and Pearson Correlation (PC) coefficients.

4 Experiments

In this section, we first show an example comparison on evaluating LIME (Section 4.1), followed by a quantitative evaluation of the popular explainability methods with the proposed approach. We choose MNIST handwritten dataset for experiments. The accuracy of the classification model to be explained is $98.5\%$ on the test set. The structure of Autoencoders is identical, whose latent vector has dimension 128. We train each for $100$ epochs with a learning rate of $1e-5$ . The Autoencoder that reconstructs the original image achieves $L1loss\approx 0.01$ , $MSEloss\approx 0.001$ and $SSIM>0.99$ . In the evaluation of Variance Proximity, we choose $S=500$ .

4.1 Case study:The Generalizability of LIME

To intuitively demonstrate the correlation between the generalizability and the quality of explanations, we first show a straightforward evaluation example. We choose an explainability approach by tuning the parameters to generate explanations with different qualities. LIME [16] is an ideal candidate because it relies on perturbing the instances to be explained to try out local decision boundaries, and the number of perturbed samples affects the quality of the explanation. The prior knowledge already exists that LIME performs poorly with few perturbed samples and improves as they increase until saturation. We segment the input into 50 super-pixels, and choose LIME with a total number of perturbed samples ( $n\_samples$ ) of 10, 30, 50, 100 and 500 for explaining the classification model respectively, and evaluate the generalizability of the explanations according to the proposed method.

The results, as illustrated in Fig. 2, indicate that insufficient number of perturbation samples may prevent Autoencoder from learning the distribution of the explanations. For instance, during training, although the loss in value (L1,MSE and SSIM) of LIME10 ( $n\_sample=10$ ) is at a low level ( $0.165\pm 1e-4$ , $0.254\pm 1e-5$ and $0.271\pm 0.004$ , respectively, the last $10$ epochs are statistically involved), the distribution is barely learned: the pixel accuracy of top-k ( $\overline{TA}$ ) converges to $25\%$ ( $26\pm 3.6\%$ ), which is almost close to the random distribution, and both Spearman and Pearson coefficients (SC and PC) oscillate near zero ( $0.021\pm 0.054$ and $0.020\pm 0.056$ , respectively), implying there is hardly (ranked) linear correlation between the learned distribution and original explanation. As $n\_sample$ increases, the distribution learned by Autoencoder from the explanations grows more accurate. It can be observed that when $n\_sample$ increases to $30$ , the top-k accuracy, Spearman and Pearson coefficients of the generated samples converge to $44.9\pm 0.1\%$ , $0.227\pm 0.003$ and $0.347\pm 0.002$ , respectively. As $n\_sample$ further grows to $50$ , these metrics are raised to $51.4\pm 0.1\%$ , $0.306\pm 0.002$ and $0.454\pm 0.003$ , respectively. However, the benefits from a continued increase in $n\_sample$ are limited ( $\overline{TA}$ , $\overline{SC}$ and $\overline{PC}$ converge to $63.7\pm 0.1\%$ , $0.448\pm 0.001$ and $0.574\pm 0.002$ , respectively when $n\_sample=500$ ). We report the final DL scores of all LIME explanations as $0.103$ , $0.341$ , $0.425$ , $0.498$ and $0.553$ , respectively.

To avoid the trap of "simple rules", we also evaluate the variances and proximities between the generated samples. Fig. 3 depicts the point-wise ranking (left) and latent space (right) proximity of the generated samples for inter and intra-classes. We observe that as the number of perturbed samples grows, the Spearman coefficients (SC) increase in varying degrees according to the class relationship. When the number of perturbation samples is minimal ( $n\_sample=10$ ), the similarity of intra- and inter-classes is approximated. The average of SC for inter and intra-class are $\overline{SC_{r}}=0.46$ and $\overline{SC_{a}}=0.49$ , respectively, whose difference is $\Delta{\left|\overline{SC_{a}^{r}}\right|}=0.02$ . Similarly, the means of FID for inter and intra-class are $\overline{FID_{r}}=1.4e-5$ and $\overline{FID_{a}}=1.8e-5$ , respectively, the corresponding difference is $\Delta{\left|\overline{FID_{a}^{r}}\right|}=3.7e-6$ ). According to the boxplot, it can be observed that the gaps between inter and intra-classes obviously increase with the growth of $n\_sample$ . According to equation 6, we derive the VP scores of LIME with different n_sample as 0.316, 0.877, 1.017, 1.000 and 0.973, respectively.

We finally report the quantitative scores of LIME with 10, 30, 50, 100 and 500 perturbed samples as 0.031, 0.296, 0.429, 0.498, 0.536. The detailed tabulated results are shown in Table S1. For intuitive understanding, we present the explanations generated by LIME with different $n\_sample$ in Fig. S3.

4.2 The Generalizability of Explainability Methods

A further step is to extend the proposed evaluation approach to more explainability methods. We choose several popular gradient-based and perturbation-based methods, including Vanilla Gradients [10], Guided Back-propagation [14], Input $\times$ Gradients [34], Integrated Gradients [12], Layer-wise Relevance Propagation [6], DeepLift [34], LIME ( $n_{s}ample=100$ ) [16] and KernelSHAP [17]. As a reference, we additionally introduce a randomly generated noise explanation. We again make predictions and explanations on MNIST for all test sets and reconstruct the explanations with the Autoencoder of the same architecture.

The training results are illustrated in Fig. 4. In terms of loss (above row), all methods (except random) can reduce the distance and statistical gap (L1, MSE and SSIM), which implies that all Explainability methods possess a degree of generalizability. This is also evidenced from the perspective of the proximity of the distribution (bottom row): All of the explainability methods far outperformed the random explanation in the three metrics, namely Top-K accuracy, Spearman and Pearson coefficients. Among these approaches, Vanilla Gradients ( $\overline{TA}=0.541\pm 2e-4$ , $\overline{SC}=0.727\pm 2e-4$ and $\overline{PC}=0.596\pm 4e-4$ , calculated by averaging the last 10 epochs), LIME ( $\overline{TA}=0.582\pm 5e-4$ , $\overline{SC}=0.381\pm 1e-3$ and $\overline{PC}=0.532\pm 8e-4$ ) and KernelSHAP ( $\overline{TA}=0.558\pm 4e-4$ , $\overline{SC}=0.362\pm 8e-4$ and $\overline{PC}=0.496\pm 1e-3$ ) slightly outperform in terms of learnability of the distribution. Comparatively, IxG and LRP are less learnable, with all three metrics being relatively inferior ( $\overline{TA}=0.515\pm 8e-4$ , $\overline{SC}=0.232\pm 8e-4$ , $\overline{PC}=0.403\pm 1e-3$ and $\overline{TA}=0.515\pm 7e-4$ , $\overline{SC}=0.234\pm 9e-4$ , $\overline{PC}=0.409\pm 1e-3$ , respectively). Furthermore, we observe that the generalizability of perturbation-based methods is superior to that of the majority of gradient-based methods ( $\overline{TA_{overall}}=0.570$ , $\overline{SC_{overall}}=0.371$ ) and $\overline{PC_{overall}}=0.514$ .

Again, we assess the Variance Proximity between the generated explanations. As demonstrated in Fig. 5, the reconstructed explanations of all explainability methods outperform the random ones, which represents that their explanations are learnable in terms of intra-class and inter-class discrepancies. However, Vanilla Gradients ( $\Delta\overline{SC_{a}^{r}}=0.025$ , $\Delta\overline{FID_{a}^{r}}=0.364$ and $VP=0.389$ ), Integrated Gradients ( $\Delta\overline{SC_{a}^{r}}=0.105$ , $\Delta\overline{FID_{a}^{r}}=0.386$ and $VP=0.491$ ) and DeepLift ( $\Delta\overline{SC_{a}^{r}}=0.107$ , $\Delta\overline{FID_{a}^{r}}=0.378$ and $VP=0.485$ ) are relatively inferior. Considering the excellent performance of Vanilla Gradients ( $DL=0.622$ ) in Distribution Learnability, we believe that the explanation distributions are relatively homogeneous, in line with the aforementioned "simple rule" trap. IG and DeepLift perform mediocrely in Distributional Learnability ( $DL=0.413$ and $0.411$ , respectively), and therefore we consider their explanations as slightly noisy, which interfere with the learning of the Autoencoder. In addition, we observe that the perturbation-based approaches perform better in the assessment of Variance Proximity ( $VP=0.963$ for LIME and $1.016$ for KernelSHAP). This is mainly attributed to the randomness and complexity of the perturbation-based explanation generating process, which prevents Autoencoder from summarizing fixed and simplistic rules.

As a conclusion, we report that in comparison to gradient-based methods ( $S_{EM}$ for $V$ , $GB$ , $I\times G$ , $IG$ , $LRP$ and $DPL$ are $0.241$ , $0.330$ , $0.196$ , $0.200$ , $0.203$ and $0.201$ , respectively), perturbation-based ones show better performance ( $S_{EM}$ for $LIME$ and $KSHAP$ are both $0.479$ ) since the distributions of their generated explanations are more generalizable. For a clearer comparison, tabulated results for all metrics and the final quantitative scores are presented in Table S2. Again, we show in Fig. S4 a comparison of several visualizations of the explanations.

4.3 Smoothed vs. Unsmoothed Maps

SmoothGrad is a technique that finds the gradient (or other method to obtain the explanations) after introducing noise to the original image multiple times and takes the average value. SmoothGrad-processed explanations are considered cleaner and more comprehensible. Interestingly, we found that SmoothGrad [13] not only enhances visual consciousness, but also increases the generalizability of the explanations. We choose three gradient-based methods as baselines, namely Vanilla Gradients (V), Input $\times$ Gradients (IxG) and Integrated Gradients (IG), and implement SmoothGrad for them respectively.

As shown in Fig. 6, for the Distribution Learnability (DL), the SmoothGrad-applied versions all outperform the original baselines (The differences in DL $\Delta{DL}$ are $0.130$ , $0.041$ and $0.031$ for V, IxG and IG, respectively), which implies that explanations with SmoothGrad are more learnable and generalizable. On the other hand, as Fig. 7 illustrated, SmoothGrad significantly strengthens the Variance Proximity (VP) of the explanations. The VPs of the three explainability methods increase by $0.295$ , $0.468$ and $0.475$ respectively. As a consequence, SmoothGrad raises the final scores of V, IxG, and IG from $0.241$ , $0.196$ , and $0.200$ to $0.514$ , $0.415$ , and $0.426$ , respectively, which is consistent with human intuition: [13] concludes that SmoothGrad significantly reduces the noise in the saliency maps and thus augments their comprehensibility. We present the detailed quantitative results in Table S3 and the two learned explanation distributions in Fig. S5.

5 Conclusion

This work provides a novel perspective for quantitatively evaluating explainability methods: generalizability. We argue that the distributions of good explanations should have clearer regularities and learnabilities, and possess distributional approximations and variations within and between classes. We demonstrate the evaluation of multiple explainability methods with the proposed approach. In the future work, we look forward to refining quantitative evaluation approaches that are more comprehensive and objective for explainability methods.

6 Supplementary Material

In this section,We provide additional materials for the main part of the paper.

6.1 Architecture & Performance of the Autoencoder

We reconstruct the explanations with a simply structured Autoencoder, which is illustrated in Fig. S1.

The qualitative performance of the reconstruction on the original MNIST dataset with the Autoencoder is illustrated in Fig. S2. Noticeably, the Autoencoder with this architecture is capable of reconstructing the original image with a high degree of restoration, which indicates that the learning ability is sufficient for the dataset.

6.2 Supplementary results for Section 4.1

In this subsection, we present the detailed results (all metrics are included) for the Case Study in Table S1, i.e., the explanation learnability for LIME with different $n\_samples$ . In Fig. S3, we illustrate the visualization of three reconstructions of the explanations. Note that the quality or contrast of the reconstructed explanations does not directly indicate the generalizability of the corresponding explanations.

6.3 Supplementary results for Section 4.2

In this subsection, we first provide detailed tabular results for the evaluation experiments of popular explainability methods in Table S2. Subsequently, we display in Fig. S4 three examples of reconstructed explanations by learning the corresponding explainability methods.

6.4 Supplementary results for Section 4.3

Again, the complementary results of the experiments on SmoothGrad are presented in Table S3. We show two groups of visualization samples in Fig. S5, including Vanilla Gradients, Input $\times$ Gradients, Integrated Gradients and their corresponding SmoothGrad versions.

Bibliography34

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] P. Linardatos, V. Papastefanopoulos, and S. Kotsiantis, “Explainable ai: A review of machine learning interpretability methods,” Entropy , vol. 23, no. 1, p. 18, 2020.
2[2] C. Rudin, “Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead,” Nature Machine Intelligence , vol. 1, no. 5, pp. 206–215, 2019.
3[3] M. Narayanan, E. Chen, J. He, B. Kim, S. Gershman, and F. Doshi-Velez, “How do humans understand explanations from machine learning systems? an evaluation of the human-interpretability of explanation,” ar Xiv preprint ar Xiv:1802.00682 , 2018.
4[4] A. Alqaraawi, M. Schuessler, P. Weiß, E. Costanza, and N. Berthouze, “Evaluating saliency map explanations for convolutional neural networks: a user study,” in Proceedings of the 25th International Conference on Intelligent User Interfaces , 2020, pp. 275–285.
5[5] V. Petsiuk, A. Das, and K. Saenko, “Rise: Randomized input sampling for explanation of black-box models,” ar Xiv preprint ar Xiv:1806.07421 , 2018.
6[6] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. Müller, and W. Samek, “On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation,” Plo S one , vol. 10, no. 7, p. e 0130140, 2015.
7[7] J. Adebayo, J. Gilmer, M. Muelly, I. Goodfellow, M. Hardt, and B. Kim, “Sanity checks for saliency maps,” Advances in neural information processing systems , vol. 31, 2018.
8[8] S. Hooker, D. Erhan, P.-J. Kindermans, and B. Kim, “A benchmark for interpretability methods in deep neural networks,” Advances in neural information processing systems , vol. 32, 2019.