Interpreting Adversarial Attacks and Defences using Architectures with Enhanced Interpretability
Akshay G Rao, Chandrashekhar Lakshminarayanan, Arun Rajkumar

TL;DR
This paper explores how Deep Linearly Gated Networks (DLGN) can interpret adversarially trained models, revealing differences in feature representations and gating patterns between robust and standard models to better understand adversarial defenses.
Contribution
It introduces the use of DLGN architecture for interpretability of adversarially trained models and compares their internal properties with standard models to uncover new insights.
Findings
PGD adversarial training aligns hyperplanes farther from data points.
PGD-AT models develop diverse, non-overlapping subnetworks across classes.
Visualizations reveal distinct representation patterns in robust models.
Abstract
Adversarial attacks in deep learning represent a significant threat to the integrity and reliability of machine learning models. Adversarial training has been a popular defence technique against these adversarial attacks. In this work, we capitalize on a network architecture, namely Deep Linearly Gated Networks (DLGN), which has better interpretation capabilities than regular deep network architectures. Using this architecture, we interpret robust models trained using PGD adversarial training and compare them with standard training. Feature networks in DLGN act as feature extractors, making them the only medium through which an adversary can attack the model. We analyze the feature network of DLGN with fully connected layers with respect to properties like alignment of the hyperplanes, hyperplane relation with PCA, and sub-network overlap among classes and compare these properties…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsPrincipal Components Analysis
