Model Guidance via Robust Feature Attribution
Mihnea Ghitu, Vihari Piratla, Matthew Wicker

TL;DR
This paper introduces a new method for guiding models to focus on relevant features by optimizing explanation robustness, effectively reducing reliance on shortcut features across various domains, including NLP and medical imaging.
Contribution
We propose a simplified objective that improves explanation robustness and mitigates shortcut learning, with theoretical justification and extensive empirical validation across multiple tasks.
Findings
Reduces test-time misclassifications by 20% compared to state-of-the-art.
Effective across diverse domains including NLP and medical imaging.
Highlights the importance of annotation quality over quantity.
Abstract
Controlling the patterns a model learns is essential to preventing reliance on irrelevant or misleading features. Such reliance on irrelevant features, often called shortcut features, has been observed across domains, including medical imaging and natural language processing, where it may lead to real-world harms. A common mitigation strategy leverages annotations (provided by humans or machines) indicating which features are relevant or irrelevant. These annotations are compared to model explanations, typically in the form of feature salience, and used to guide the loss function during training. Unfortunately, recent works have demonstrated that feature salience methods are unreliable and therefore offer a poor signal to optimize. In this work, we propose a simplified objective that simultaneously optimizes for explanation robustness and mitigation of shortcut learning. Unlike prior…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning
