Discriminative Feature Attributions: Bridging Post Hoc Explainability and Inherent Interpretability
Usha Bhalla, Suraj Srinivas, Himabindu Lakkaraju

TL;DR
This paper introduces Distractor Erasure Tuning (DiET), a method that enhances black-box models' robustness to distractor features, leading to more faithful and discriminative feature attributions that bridge post hoc explainability and inherent interpretability.
Contribution
The paper proposes DiET, a novel training strategy that improves the faithfulness of feature attributions by making black-box models robust to distractor erasure, combining benefits of both explanation approaches.
Findings
DiET produces models closely matching original black-box behavior.
DiET explanations align with ground truth attributions.
Enhanced robustness improves explanation faithfulness.
Abstract
With the increased deployment of machine learning models in various real-world applications, researchers and practitioners alike have emphasized the need for explanations of model behaviour. To this end, two broad strategies have been outlined in prior literature to explain models. Post hoc explanation methods explain the behaviour of complex black-box models by identifying features critical to model predictions; however, prior work has shown that these explanations may not be faithful, in that they incorrectly attribute high importance to features that are unimportant or non-discriminative for the underlying task. Inherently interpretable models, on the other hand, circumvent these issues by explicitly encoding explanations into model architecture, meaning their explanations are naturally faithful, but they often exhibit poor predictive performance due to their limited expressive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning
MethodsHigh-Order Consensuses
