Defending Against Unforeseen Failure Modes with Latent Adversarial Training

Stephen Casper; Lennart Schulze; Oam Patel; Dylan Hadfield-Menell

arXiv:2403.05030·cs.CR·July 30, 2025·2 cites

Defending Against Unforeseen Failure Modes with Latent Adversarial Training

Stephen Casper, Lennart Schulze, Oam Patel, Dylan Hadfield-Menell

PDF

Open Access 2 Repos 1 Models

TL;DR

Latent adversarial training (LAT) enhances AI robustness by defending against unforeseen failure modes without prior knowledge of specific attacks, improving performance on both clean and adversarial data across multiple tasks.

Contribution

This work introduces LAT as a novel defense mechanism that leverages latent representations to protect against unknown vulnerabilities without requiring attack-specific data.

Findings

01

LAT improves robustness to unseen attacks.

02

LAT enhances performance on clean data.

03

LAT effectively defends against backdoors and novel adversarial attacks.

Abstract

Despite extensive diagnostics and debugging by developers, AI systems sometimes exhibit harmful unintended behaviors. Finding and fixing these is challenging because the attack surface is so large -- it is not tractable to exhaustively search for inputs that may elicit harmful behaviors. Red-teaming and adversarial training (AT) are commonly used to improve robustness, however, they empirically struggle to fix failure modes that differ from the attacks used during training. In this work, we utilize latent adversarial training (LAT) to defend against vulnerabilities without leveraging knowledge of what they are or using inputs that elicit them. LAT makes use of the compressed, abstract, and structured latent representations of concepts that the network actually uses for prediction. Here, we use it to defend against failure modes without examples that elicit them. Specifically, we use LAT…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
girishgupta/deep-ignorance-unfiltered_unlearned_lat
model· 16 dl
16 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Anomaly Detection Techniques and Applications · Fault Detection and Control Systems