Circumventing Safety Alignment in Large Language Models Through Embedding Space Toxicity Attenuation

Zhibo Zhang; Yuxi Li; Kailong Wang; Shuai Yuan; Ling Shi; Haoyu Wang

arXiv:2507.08020·cs.CL·July 14, 2025

Circumventing Safety Alignment in Large Language Models Through Embedding Space Toxicity Attenuation

Zhibo Zhang, Yuxi Li, Kailong Wang, Shuai Yuan, Ling Shi, Haoyu Wang

PDF

Open Access

TL;DR

This paper introduces ETTA, a novel method to manipulate LLM embeddings to bypass safety measures, revealing vulnerabilities in current alignment strategies and emphasizing the need for embedding-aware defenses.

Contribution

The paper presents ETTA, a new framework for targeted toxicity attenuation in embedding space that does not require model fine-tuning or training data access.

Findings

01

ETTA achieves an average attack success rate of 88.61%.

02

ETTA outperforms baseline methods by 11.34%.

03

ETTA generalizes to safety-enhanced models with 77.39% success rate.

Abstract

Large Language Models (LLMs) have achieved remarkable success across domains such as healthcare, education, and cybersecurity. However, this openness also introduces significant security risks, particularly through embedding space poisoning, which is a subtle attack vector where adversaries manipulate the internal semantic representations of input data to bypass safety alignment mechanisms. While previous research has investigated universal perturbation methods, the dynamics of LLM safety alignment at the embedding level remain insufficiently understood. Consequently, more targeted and accurate adversarial perturbation techniques, which pose significant threats, have not been adequately studied. In this work, we propose ETTA (Embedding Transformation Toxicity Attenuation), a novel framework that identifies and attenuates toxicity-sensitive dimensions in embedding space via linear…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection · Topic Modeling