Explainability-Guided Defense: Attribution-Aware Model Refinement Against Adversarial Data Attacks
Longwei Wang, Mohammad Navid Nayyem, Abdullah Al Rakin, KC Santosh, Chaowei Zhang, Yang Zhou

TL;DR
This paper introduces an attribution-guided training framework that leverages interpretability explanations to improve the robustness of deep learning models against adversarial attacks, without needing extra data or architecture changes.
Contribution
It presents a novel method that actively uses explanation techniques like LIME during training to suppress spurious features, enhancing adversarial robustness and generalization.
Findings
Significant robustness improvements on CIFAR datasets
Effective suppression of spurious, irrelevant features
Theoretical link between explanation alignment and robustness
Abstract
The growing reliance on deep learning models in safety-critical domains such as healthcare and autonomous navigation underscores the need for defenses that are both robust to adversarial perturbations and transparent in their decision-making. In this paper, we identify a connection between interpretability and robustness that can be directly leveraged during training. Specifically, we observe that spurious, unstable, or semantically irrelevant features identified through Local Interpretable Model-Agnostic Explanations (LIME) contribute disproportionately to adversarial vulnerability. Building on this insight, we introduce an attribution-guided refinement framework that transforms LIME from a passive diagnostic into an active training signal. Our method systematically suppresses spurious features using feature masking, sensitivity-aware regularization, and adversarial augmentation in a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis
