Probing the Robustness of Large Language Models Safety to Latent Perturbations
Tianle Gu, Kexin Huang, Zongqi Wang, Yixu Wang, Jie Li, Yuanqi Yao, Yang Yao, Yujiu Yang, Yan Teng, Yingchun Wang

TL;DR
This paper investigates the vulnerability of safety-aligned large language models to latent space perturbations, introduces diagnostic tools and attack methods, and proposes a fine-tuning strategy to enhance robustness at the representation level.
Contribution
It presents a probing method for latent sensitivity, develops the Activation Steering Attack, and introduces Layer-wise Adversarial Patch Training to improve safety robustness.
Findings
LAPT enhances alignment robustness without losing capabilities.
Latent perturbations can trigger unsafe responses in aligned models.
Current surface-level alignment methods are insufficient for robustness.
Abstract
Safety alignment is a key requirement for building reliable Artificial General Intelligence. Despite significant advances in safety alignment, we observe that minor latent shifts can still trigger unsafe responses in aligned models. We argue that this stems from the shallow nature of existing alignment methods, which focus on surface-level refusal behaviors without sufficiently altering internal representations. Consequently, small shifts in hidden activations can re-trigger harmful behaviors embedded in the latent space. To explore the robustness of safety alignment to latent perturbations, we introduce a probing method that measures the Negative Log-Likelihood of the original response generated by the model. This probe quantifies local sensitivity in the latent space, serving as a diagnostic tool for identifying vulnerable directions. Based on this signal, we construct effective…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
Overall, the paper studies an important problem and the systematic approach to measuring the robustness of models to perturbations of their internal representations appears to be a novel contribution. Their training methodology seems potentially useful as it uses different objectives from standard latent adversarial training. Overall, the paper seems like it may describe a novel latent attack methodology and the particular method of using a target compliant suffix to generate the attacks seems n
The paper does not appropriately characterize itself with respect to the related literature and suffers from clarity issues about its contributions or evaluations. Within the main body of the paper, related work is discussed in a single paragraph and primarily provides a generic overview. Discussion of specific related work is delegated to an appendix. This makes it quite hard to evaluate the paper and creates a misleading perception of the work. A particular concern is that the paper does not
The core method here is sound and worth knowing about. It seems realistic that malicious actors will want to steer open-source models to do tasks their developers didn't intend, so benchmarking and defending against this seems like a necessary and timely area for intervention. The core technical approach is sound and the experimental methodology is reasonable. The NLL-based probing method provides a principled way to identify vulnerable directions in latent space, and the normalization scheme (E
While it's true that I'm not aware of other work exploring this axis of attack on models, I'd like the authors to explain more of why or when this would be a realistic threat model. This degree of whitebox access isn't common in commercial AI deployments, especially among models that top the leaderboards on various relevant axes. It's also unclear why attackers with whitebox access would use activation steering over other methods to make harmless models harmful. The methods here are also mostl
* The general area is pretty promising (using latent space attacks to find adversarial vulnerabilities and train them away) * The particular approach seems like a reasonable approach to have tested
* It seems obvious that latent adversarial attacks will work for eliciting harmful info from models (I think this has also been shown in prior work). I think it doesn’t have major implications to me as well, since models aren’t trained to be robust to latent space attacks. The existence of latent attacks doesn’t clearly imply anything about the model’s actual robustness on the real distribution of inputs (which is what matters) * I like the motivation to train against these attacks though, rathe
S1: I think that NLL probing is clever, and I am glad that the authors got it to work as a useful proxy for the latent attackability of latent activations. S2: I like the experiments to combine ASA with other attacks. S3: Overall, I think that the work is clever and working on a good problem.
W1: I think that claim number 1 at the end of the introduction is probably an over claim. I think this has been demonstrated before: - https://arxiv.org/abs/2412.09565 - https://arxiv.org/abs/2312.02780 - https://arxiv.org/abs/2502.05209 - https://arxiv.org/abs/2403.05030 - https://arxiv.org/abs/2406.11717 W2: In Figure 2, INIT correlates with both MASR and PASR. This seems very important. It validates the hypothesis that vulnerability to ASA is indicative of weak alignment overall. You should
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)
MethodsFocus
