Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

Abhay Sheshadri; Aidan Ewart; Phillip Guo; Aengus Lynch; Cindy Wu; Vivek Hebbar; Henry Sleight; Asa Cooper Stickland; Ethan Perez; Dylan Hadfield-Menell; Stephen Casper

arXiv:2407.15549·cs.LG·July 30, 2025·3 cites

Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

Abhay Sheshadri, Aidan Ewart, Phillip Guo, Aengus Lynch, Cindy Wu, Vivek Hebbar, Henry Sleight, Asa Cooper Stickland, Ethan Perez, Dylan Hadfield-Menell, Stephen Casper

PDF

Open Access 2 Repos 3 Reviews

TL;DR

This paper introduces targeted latent adversarial training (LAT) to enhance the robustness of large language models against harmful behaviors like jailbreaks and backdoors, outperforming existing methods with less compute.

Contribution

It demonstrates that targeted LAT can effectively defend against specific failure modes in LLMs, including jailbreaks, backdoors, and undesirable knowledge, with improved efficiency.

Findings

01

Targeted LAT outperforms baseline methods in robustness to jailbreaks.

02

It more effectively removes backdoors without trigger knowledge.

03

It enhances unlearning of undesirable tasks with increased robustness.

Abstract

Large language models (LLMs) can often be made to behave in undesirable ways that they are explicitly fine-tuned not to. For example, the LLM red-teaming literature has produced a wide variety of 'jailbreaking' techniques to elicit harmful text from models that were fine-tuned to be harmless. Recent work on red-teaming, model editing, and interpretability suggests that this challenge stems from how (adversarial) fine-tuning largely serves to suppress rather than remove undesirable capabilities from LLMs. Prior work has introduced latent adversarial training (LAT) as a way to improve robustness to broad classes of failures. These prior works have considered untargeted latent space attacks where the adversary perturbs latent activations to maximize loss on examples of desirable behavior. Untargeted LAT can provide a generic type of robustness but does not leverage information about…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 4

Strengths

Pros: - I believe targeted LAT *can be* a useful attack-agnostic defense, although the current evaluation lacks depth (see below). - The breadth of evaluation is appealing. It’s nice to see a method that potentially improves on safety/alignment across multiple diverse tasks.

Weaknesses

Weaknesses: - The attacks used for the evaluation in the main table (Table 2) are quite weak: the best attack success rate is 27.7% on Llama-3-8B Instruct, although it’s possible to achieve ~100% ASR on this model (e.g., as reported in [Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks](https://arxiv.org/abs/2404.02151) but with a different judge). Without strong enough attacks, it’s hard to conclude that the defense is effective enough, especially given the anecdotal evidenc

Reviewer 02Rating 5Confidence 3

Strengths

- The paper introduces targeted Latent Adversarial Training (LAT). This computationally efficient approach enhances the robustness of LLMs by specifically targeting latent activations. - Extensive experiments have been conducted to provide a good insight into the components of the proposed method. - The paper is generally well-written. With clear illustrations and tables.

Weaknesses

- This paper follows a general adversarial training pipeline, which requires maximizing the adversarial loss while minimizing the "safety loss." The framework itself is familiar for adversarial training, which might hinder the contribution of the paper. - As the proposed method shares similarities to the latent adversarial training (LAT), the paper needs to discuss the difference between the proposed method and the previous LAT. In addition, as the LAT perturbed the layer's activation, choosing

Reviewer 03Rating 3Confidence 4

Strengths

(1) The t-LAT algorithm seems effective across a wide range of tasks and is flexible enough to be combined with many optimization objectives without adding much overhead. (2) The authors provide necessary implementation guidelines such as adding additional SFT loss or KL divergence.

Weaknesses

**Major** (1) Section 4.1: The attacks considered are not strong enough with most of them achieving ASR < 20% against the base model, making it questionable whether the proposed technique will bring improvement when faced with more advanced jailbreak attacks like [1], [2] and [3]. Also, both Llama-2 and Llama-3 are very safe models. I think the authors should experiment with weaker models like Vicuna-7B. An improvement of 2% in ASR is still somewhat marginal for me. (Also, I encourage the autho

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning

MethodsRecurrent Replay Distributed DQN