Explainability-Guided Defense: Attribution-Aware Model Refinement Against Adversarial Data Attacks

Longwei Wang; Mohammad Navid Nayyem; Abdullah Al Rakin; KC Santosh; Chaowei Zhang; Yang Zhou

arXiv:2601.00968·cs.LG·January 6, 2026

Explainability-Guided Defense: Attribution-Aware Model Refinement Against Adversarial Data Attacks

Longwei Wang, Mohammad Navid Nayyem, Abdullah Al Rakin, KC Santosh, Chaowei Zhang, Yang Zhou

PDF

Open Access

TL;DR

This paper introduces an attribution-guided training framework that leverages interpretability explanations to improve the robustness of deep learning models against adversarial attacks, without needing extra data or architecture changes.

Contribution

It presents a novel method that actively uses explanation techniques like LIME during training to suppress spurious features, enhancing adversarial robustness and generalization.

Findings

01

Significant robustness improvements on CIFAR datasets

02

Effective suppression of spurious, irrelevant features

03

Theoretical link between explanation alignment and robustness

Abstract

The growing reliance on deep learning models in safety-critical domains such as healthcare and autonomous navigation underscores the need for defenses that are both robust to adversarial perturbations and transparent in their decision-making. In this paper, we identify a connection between interpretability and robustness that can be directly leveraged during training. Specifically, we observe that spurious, unstable, or semantically irrelevant features identified through Local Interpretable Model-Agnostic Explanations (LIME) contribute disproportionately to adversarial vulnerability. Building on this insight, we introduce an attribution-guided refinement framework that transforms LIME from a passive diagnostic into an active training signal. Our method systematically suppresses spurious features using feature masking, sensitivity-aware regularization, and adversarial augmentation in a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis