Circumventing Safety Alignment in Large Language Models Through Embedding Space Toxicity Attenuation
Zhibo Zhang, Yuxi Li, Kailong Wang, Shuai Yuan, Ling Shi, Haoyu Wang

TL;DR
This paper introduces ETTA, a novel method to manipulate LLM embeddings to bypass safety measures, revealing vulnerabilities in current alignment strategies and emphasizing the need for embedding-aware defenses.
Contribution
The paper presents ETTA, a new framework for targeted toxicity attenuation in embedding space that does not require model fine-tuning or training data access.
Findings
ETTA achieves an average attack success rate of 88.61%.
ETTA outperforms baseline methods by 11.34%.
ETTA generalizes to safety-enhanced models with 77.39% success rate.
Abstract
Large Language Models (LLMs) have achieved remarkable success across domains such as healthcare, education, and cybersecurity. However, this openness also introduces significant security risks, particularly through embedding space poisoning, which is a subtle attack vector where adversaries manipulate the internal semantic representations of input data to bypass safety alignment mechanisms. While previous research has investigated universal perturbation methods, the dynamics of LLM safety alignment at the embedding level remain insufficiently understood. Consequently, more targeted and accurate adversarial perturbation techniques, which pose significant threats, have not been adequately studied. In this work, we propose ETTA (Embedding Transformation Toxicity Attenuation), a novel framework that identifies and attenuates toxicity-sensitive dimensions in embedding space via linear…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection · Topic Modeling
