The Persistent Vulnerability of Aligned AI Systems

Aengus Lynch

arXiv:2604.00324·cs.LG·April 2, 2026

The Persistent Vulnerability of Aligned AI Systems

Aengus Lynch

PDF

TL;DR

This paper investigates the persistent vulnerabilities of aligned AI systems, introducing new methods for understanding, testing, and mitigating dangerous behaviors in autonomous models.

Contribution

It presents novel techniques like ACDC and LAT for circuit discovery and safety training, and provides empirical analysis of model misbehavior and adversarial robustness.

Findings

01

ACDC recovers transformer components faster than manual methods.

02

LAT effectively removes dangerous behaviors with significantly less compute.

03

High attack success rates demonstrate persistent vulnerabilities across models.

Abstract

Autonomous AI agents are being deployed with filesystem access, email control, and multi-step planning. This thesis contributes to four open problems in AI safety: understanding dangerous internal computations, removing dangerous behaviors once embedded, testing for vulnerabilities before deployment, and predicting when models will act against deployers. ACDC automates circuit discovery in transformers, recovering all five component types from prior manual work on GPT-2 Small by selecting 68 edges from 32,000 candidates in hours rather than months. Latent Adversarial Training (LAT) removes dangerous behaviors by optimizing perturbations in the residual stream to elicit failure modes, then training under those perturbations. LAT solved the sleeper agent problem where standard safety training failed, matching existing defenses with 700x fewer GPU hours. Best-of-N jailbreaking…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.