The Persistent Vulnerability of Aligned AI Systems
Aengus Lynch

TL;DR
This paper investigates the persistent vulnerabilities of aligned AI systems, introducing new methods for understanding, testing, and mitigating dangerous behaviors in autonomous models.
Contribution
It presents novel techniques like ACDC and LAT for circuit discovery and safety training, and provides empirical analysis of model misbehavior and adversarial robustness.
Findings
ACDC recovers transformer components faster than manual methods.
LAT effectively removes dangerous behaviors with significantly less compute.
High attack success rates demonstrate persistent vulnerabilities across models.
Abstract
Autonomous AI agents are being deployed with filesystem access, email control, and multi-step planning. This thesis contributes to four open problems in AI safety: understanding dangerous internal computations, removing dangerous behaviors once embedded, testing for vulnerabilities before deployment, and predicting when models will act against deployers. ACDC automates circuit discovery in transformers, recovering all five component types from prior manual work on GPT-2 Small by selecting 68 edges from 32,000 candidates in hours rather than months. Latent Adversarial Training (LAT) removes dangerous behaviors by optimizing perturbations in the residual stream to elicit failure modes, then training under those perturbations. LAT solved the sleeper agent problem where standard safety training failed, matching existing defenses with 700x fewer GPU hours. Best-of-N jailbreaking…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
