Learning to Disentangle Robust and Vulnerable Features for Adversarial Detection
Byunggill Joe, Sung Ju Hwang, Insik Shin

TL;DR
This paper introduces a novel method to disentangle robust and vulnerable features in neural networks using variational autoencoders, improving adversarial detection against both blackbox and whitebox attacks.
Contribution
It proposes a minimax game framework to separate robust and vulnerable features, enhancing adversarial detection and understanding of adversarial inputs.
Findings
Effective detection of adversarial inputs on multiple datasets
Robust features resist adversarial perturbations
Vulnerable features are key to understanding adversarial success
Abstract
Although deep neural networks have shown promising performances on various tasks, even achieving human-level performance on some, they are shown to be susceptible to incorrect predictions even with imperceptibly small perturbations to an input. There exists a large number of previous works which proposed to defend against such adversarial attacks either by robust inference or detection of adversarial inputs. Yet, most of them cannot effectively defend against whitebox attacks where an adversary has a knowledge of the model and defense. More importantly, they do not provide a convincing reason why the generated adversarial inputs successfully fool the target models. To address these shortcomings of the existing approaches, we hypothesize that the adversarial inputs are tied to latent features that are susceptible to adversarial perturbation, which we call vulnerable features. Then based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Anomaly Detection Techniques and Applications · Forensic and Genetic Research
