The Out-of-Distribution Problem in Explainability and Search Methods for Feature Importance Explanations
Peter Hase, Harry Xie, Mohit Bansal

TL;DR
This paper investigates the out-of-distribution issues in feature importance explanations, proposes training modifications for better alignment, compares feature removal methods, and introduces a new search algorithm that outperforms existing baselines.
Contribution
It highlights the OOD problem in FI explanations, proposes a training adjustment for improved social alignment, and introduces a novel search-based explanation method that surpasses existing approaches.
Findings
Model training adjustments improve explanation alignment.
Some feature removal methods produce more OOD counterfactuals.
Parallel Local Search outperforms other explanation search methods.
Abstract
Feature importance (FI) estimates are a popular form of explanation, and they are commonly created and evaluated by computing the change in model confidence caused by removing certain input features at test time. For example, in the standard Sufficiency metric, only the top-k most important tokens are kept. In this paper, we study several under-explored dimensions of FI explanations, providing conceptual and empirical improvements for this form of explanation. First, we advance a new argument for why it can be problematic to remove features from an input when creating or evaluating explanations: the fact that these counterfactual inputs are out-of-distribution (OOD) to models implies that the resulting explanations are socially misaligned. The crux of the problem is that the model prior and random weight initialization influence the explanations (and explanation metrics) in unintended…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Machine Learning and Data Classification
MethodsCounterfactuals Explanations · Random Search · Local Interpretable Model-Agnostic Explanations
