Benchmarking Attribution Methods with Relative Feature Importance
Mengjiao Yang, Been Kim

TL;DR
This paper introduces BAM, a framework for quantitatively evaluating feature attribution methods using datasets with known feature importance, revealing that some methods produce false positive explanations.
Contribution
The paper presents a novel benchmarking framework with datasets, models, and metrics for evaluating attribution methods against known feature importance.
Findings
Certain attribution methods are prone to false positives
The framework enables quantitative comparison of attribution methods
Open source resources facilitate future research
Abstract
Interpretability is an important area of research for safe deployment of machine learning systems. One particular type of interpretability method attributes model decisions to input features. Despite active development, quantitative evaluation of feature attribution methods remains difficult due to the lack of ground truth: we do not know which input features are in fact important to a model. In this work, we propose a framework for Benchmarking Attribution Methods (BAM) with a priori knowledge of relative feature importance. BAM includes 1) a carefully crafted dataset and models trained with known relative feature importance and 2) three complementary metrics to quantitatively evaluate attribution methods by comparing feature attributions between pairs of models and pairs of inputs. Our evaluation on several widely-used attribution methods suggests that certain methods are more likely…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Machine Learning and Data Classification
MethodsBottleneck Attention Module · Interpretability
