AttributionLab: Faithfulness of Feature Attribution Under Controllable Environments
Yang Zhang, Yawei Li, Hannah Brown, Mina Rezaei, Bernd Bischl, Philip, Torr, Ashkan Khakzar, Kenji Kawaguchi

TL;DR
AttributionLab provides a controlled environment to rigorously test the faithfulness of feature attribution methods in neural networks, ensuring their reliability before application in real-world scenarios.
Contribution
We designed a novel setup, AttributionLab, with manually set network weights and controlled data to serve as a sanity check for attribution method faithfulness.
Findings
AttributionLab can identify unfaithful attribution methods.
Controlled environment reveals limitations of existing attribution techniques.
Framework enables systematic analysis and improvement of attribution methods.
Abstract
Feature attribution explains neural network outputs by identifying relevant input features. The attribution has to be faithful, meaning that the attributed features must mirror the input features that influence the output. One recent trend to test faithfulness is to fit a model on designed data with known relevant features and then compare attributions with ground truth input features.This idea assumes that the model learns to use all and only these designed features, for which there is no guarantee. In this paper, we solve this issue by designing the network and manually setting its weights, along with designing data. The setup, AttributionLab, serves as a sanity check for faithfulness: If an attribution method is not faithful in a controlled environment, it can be unreliable in the wild. The environment is also a laboratory for controlled experiments by which we can analyze…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Neural Networks and Applications
