Defending Against Unforeseen Failure Modes with Latent Adversarial Training
Stephen Casper, Lennart Schulze, Oam Patel, Dylan Hadfield-Menell

TL;DR
Latent adversarial training (LAT) enhances AI robustness by defending against unforeseen failure modes without prior knowledge of specific attacks, improving performance on both clean and adversarial data across multiple tasks.
Contribution
This work introduces LAT as a novel defense mechanism that leverages latent representations to protect against unknown vulnerabilities without requiring attack-specific data.
Findings
LAT improves robustness to unseen attacks.
LAT enhances performance on clean data.
LAT effectively defends against backdoors and novel adversarial attacks.
Abstract
Despite extensive diagnostics and debugging by developers, AI systems sometimes exhibit harmful unintended behaviors. Finding and fixing these is challenging because the attack surface is so large -- it is not tractable to exhaustively search for inputs that may elicit harmful behaviors. Red-teaming and adversarial training (AT) are commonly used to improve robustness, however, they empirically struggle to fix failure modes that differ from the attacks used during training. In this work, we utilize latent adversarial training (LAT) to defend against vulnerabilities without leveraging knowledge of what they are or using inputs that elicit them. LAT makes use of the compressed, abstract, and structured latent representations of concepts that the network actually uses for prediction. Here, we use it to defend against failure modes without examples that elicit them. Specifically, we use LAT…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Anomaly Detection Techniques and Applications · Fault Detection and Control Systems
