Probing the Robustness of Large Language Models Safety to Latent Perturbations

Tianle Gu; Kexin Huang; Zongqi Wang; Yixu Wang; Jie Li; Yuanqi Yao; Yang Yao; Yujiu Yang; Yan Teng; Yingchun Wang

arXiv:2506.16078·cs.LG·June 23, 2025

Probing the Robustness of Large Language Models Safety to Latent Perturbations

Tianle Gu, Kexin Huang, Zongqi Wang, Yixu Wang, Jie Li, Yuanqi Yao, Yang Yao, Yujiu Yang, Yan Teng, Yingchun Wang

PDF

Open Access 1 Repo 4 Reviews

TL;DR

This paper investigates the vulnerability of safety-aligned large language models to latent space perturbations, introduces diagnostic tools and attack methods, and proposes a fine-tuning strategy to enhance robustness at the representation level.

Contribution

It presents a probing method for latent sensitivity, develops the Activation Steering Attack, and introduces Layer-wise Adversarial Patch Training to improve safety robustness.

Findings

01

LAPT enhances alignment robustness without losing capabilities.

02

Latent perturbations can trigger unsafe responses in aligned models.

03

Current surface-level alignment methods are insufficient for robustness.

Abstract

Safety alignment is a key requirement for building reliable Artificial General Intelligence. Despite significant advances in safety alignment, we observe that minor latent shifts can still trigger unsafe responses in aligned models. We argue that this stems from the shallow nature of existing alignment methods, which focus on surface-level refusal behaviors without sufficiently altering internal representations. Consequently, small shifts in hidden activations can re-trigger harmful behaviors embedded in the latent space. To explore the robustness of safety alignment to latent perturbations, we introduce a probing method that measures the Negative Log-Likelihood of the original response generated by the model. This probe quantifies local sensitivity in the latent space, serving as a diagnostic tool for identifying vulnerable directions. Based on this signal, we construct effective…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 4

Strengths

Overall, the paper studies an important problem and the systematic approach to measuring the robustness of models to perturbations of their internal representations appears to be a novel contribution. Their training methodology seems potentially useful as it uses different objectives from standard latent adversarial training. Overall, the paper seems like it may describe a novel latent attack methodology and the particular method of using a target compliant suffix to generate the attacks seems n

Weaknesses

The paper does not appropriately characterize itself with respect to the related literature and suffers from clarity issues about its contributions or evaluations. Within the main body of the paper, related work is discussed in a single paragraph and primarily provides a generic overview. Discussion of specific related work is delegated to an appendix. This makes it quite hard to evaluate the paper and creates a misleading perception of the work. A particular concern is that the paper does not

Reviewer 02Rating 2Confidence 4

Strengths

The core method here is sound and worth knowing about. It seems realistic that malicious actors will want to steer open-source models to do tasks their developers didn't intend, so benchmarking and defending against this seems like a necessary and timely area for intervention. The core technical approach is sound and the experimental methodology is reasonable. The NLL-based probing method provides a principled way to identify vulnerable directions in latent space, and the normalization scheme (E

Weaknesses

While it's true that I'm not aware of other work exploring this axis of attack on models, I'd like the authors to explain more of why or when this would be a realistic threat model. This degree of whitebox access isn't common in commercial AI deployments, especially among models that top the leaderboards on various relevant axes. It's also unclear why attackers with whitebox access would use activation steering over other methods to make harmless models harmful. The methods here are also mostl

Reviewer 03Rating 2Confidence 3

Strengths

* The general area is pretty promising (using latent space attacks to find adversarial vulnerabilities and train them away) * The particular approach seems like a reasonable approach to have tested

Weaknesses

* It seems obvious that latent adversarial attacks will work for eliciting harmful info from models (I think this has also been shown in prior work). I think it doesn’t have major implications to me as well, since models aren’t trained to be robust to latent space attacks. The existence of latent attacks doesn’t clearly imply anything about the model’s actual robustness on the real distribution of inputs (which is what matters) * I like the motivation to train against these attacks though, rathe

Reviewer 04Rating 8Confidence 4

Strengths

S1: I think that NLL probing is clever, and I am glad that the authors got it to work as a useful proxy for the latent attackability of latent activations. S2: I like the experiments to combine ASA with other attacks. S3: Overall, I think that the work is clever and working on a good problem.

Weaknesses

W1: I think that claim number 1 at the end of the introduction is probably an over claim. I think this has been demonstrated before: - https://arxiv.org/abs/2412.09565 - https://arxiv.org/abs/2312.02780 - https://arxiv.org/abs/2502.05209 - https://arxiv.org/abs/2403.05030 - https://arxiv.org/abs/2406.11717 W2: In Figure 2, INIT correlates with both MASR and PASR. This seems very important. It validates the hypothesis that vulnerability to ASA is indicative of weak alignment overall. You should

Code & Models

Repositories

carol-gutianle/latentsafety
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)

MethodsFocus