Where Classification Fails, Interpretation Rises
Chanh Nguyen, Georgi Georgiev, Yujie Ji, Ting Wang

TL;DR
This paper introduces a novel adversarial input detection framework that compares model interpretations with classifications, leveraging human discernibility to improve detection robustness against adversarial attacks.
Contribution
It proposes a new detection approach based on interpretability, contrasting interpretations with classifications, which is a departure from pattern-based methods.
Findings
Effective detection across multiple benchmark datasets
Robust against adaptive adversarial attacks
Opens new directions in adversarial input detection
Abstract
An intriguing property of deep neural networks is their inherent vulnerability to adversarial inputs, which significantly hinders their application in security-critical domains. Most existing detection methods attempt to use carefully engineered patterns to distinguish adversarial inputs from their genuine counterparts, which however can often be circumvented by adaptive adversaries. In this work, we take a completely different route by leveraging the definition of adversarial inputs: while deceiving for deep neural networks, they are barely discernible for human visions. Building upon recent advances in interpretable models, we construct a new detection framework that contrasts an input's interpretation against its classification. We validate the efficacy of this framework through extensive experiments using benchmark datasets and attacks. We believe that this work opens a new…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Anomaly Detection Techniques and Applications · Domain Adaptation and Few-Shot Learning
