Discriminative Feature Attributions: Bridging Post Hoc Explainability   and Inherent Interpretability

Usha Bhalla; Suraj Srinivas; Himabindu Lakkaraju

arXiv:2307.15007·cs.LG·February 19, 2024·2 cites

Discriminative Feature Attributions: Bridging Post Hoc Explainability and Inherent Interpretability

Usha Bhalla, Suraj Srinivas, Himabindu Lakkaraju

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces Distractor Erasure Tuning (DiET), a method that enhances black-box models' robustness to distractor features, leading to more faithful and discriminative feature attributions that bridge post hoc explainability and inherent interpretability.

Contribution

The paper proposes DiET, a novel training strategy that improves the faithfulness of feature attributions by making black-box models robust to distractor erasure, combining benefits of both explanation approaches.

Findings

01

DiET produces models closely matching original black-box behavior.

02

DiET explanations align with ground truth attributions.

03

Enhanced robustness improves explanation faithfulness.

Abstract

With the increased deployment of machine learning models in various real-world applications, researchers and practitioners alike have emphasized the need for explanations of model behaviour. To this end, two broad strategies have been outlined in prior literature to explain models. Post hoc explanation methods explain the behaviour of complex black-box models by identifying features critical to model predictions; however, prior work has shown that these explanations may not be faithful, in that they incorrectly attribute high importance to features that are unimportant or non-discriminative for the underlying task. Inherently interpretable models, on the other hand, circumvent these issues by explicitly encoding explanations into model architecture, meaning their explanations are naturally faithful, but they often exhibit poor predictive performance due to their limited expressive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ai4life-group/diet
pytorchOfficial

Videos

Discriminative Feature Attributions: Bridging Post Hoc Explainability and Inherent Interpretability· slideslive

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning

MethodsHigh-Order Consensuses