Interpreting Adversarial Attacks and Defences using Architectures with   Enhanced Interpretability

Akshay G Rao; Chandrashekhar Lakshminarayanan; Arun Rajkumar

arXiv:2502.15017·cs.LG·February 24, 2025

Interpreting Adversarial Attacks and Defences using Architectures with Enhanced Interpretability

Akshay G Rao, Chandrashekhar Lakshminarayanan, Arun Rajkumar

PDF

TL;DR

This paper explores how Deep Linearly Gated Networks (DLGN) can interpret adversarially trained models, revealing differences in feature representations and gating patterns between robust and standard models to better understand adversarial defenses.

Contribution

It introduces the use of DLGN architecture for interpretability of adversarially trained models and compares their internal properties with standard models to uncover new insights.

Findings

01

PGD adversarial training aligns hyperplanes farther from data points.

02

PGD-AT models develop diverse, non-overlapping subnetworks across classes.

03

Visualizations reveal distinct representation patterns in robust models.

Abstract

Adversarial attacks in deep learning represent a significant threat to the integrity and reliability of machine learning models. Adversarial training has been a popular defence technique against these adversarial attacks. In this work, we capitalize on a network architecture, namely Deep Linearly Gated Networks (DLGN), which has better interpretation capabilities than regular deep network architectures. Using this architecture, we interpret robust models trained using PGD adversarial training and compare them with standard training. Feature networks in DLGN act as feature extractors, making them the only medium through which an adversary can attack the model. We analyze the feature network of DLGN with fully connected layers with respect to properties like alignment of the hyperplanes, hyperplane relation with PCA, and sub-network overlap among classes and compare these properties…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsPrincipal Components Analysis