Jailbreaking LLMs' Safeguard with Universal Magic Words for Text Embedding Models

Haoyu Liang; Youran Sun; Yunfeng Cai; Jun Zhu; Bo Zhang

arXiv:2501.18280·cs.CL·May 20, 2025

Jailbreaking LLMs' Safeguard with Universal Magic Words for Text Embedding Models

Haoyu Liang, Youran Sun, Yunfeng Cai, Jun Zhu, Bo Zhang

PDF

Open Access 3 Reviews

TL;DR

This paper identifies a bias in text embedding models that can be exploited using universal magic words to bypass safeguards in large language models, leading to potential security breaches, and proposes methods to defend against such attacks.

Contribution

It introduces a novel attack method using universal magic words to manipulate text embeddings and bypass safeguards, along with effective defense strategies to mitigate this security risk.

Findings

01

Magic words significantly degrade safeguard performance.

02

Attacks cause harmful outputs in real-world chatbots.

03

Defense methods effectively reduce embedding bias.

Abstract

The security issue of large language models (LLMs) has gained wide attention recently, with various defense mechanisms developed to prevent harmful output, among which safeguards based on text embedding models serve as a fundamental defense. Through testing, we discover that the output distribution of text embedding models is severely biased with a large mean. Inspired by this observation, we propose novel, efficient methods to search for **universal magic words** that attack text embedding models. Universal magic words as suffixes can shift the embedding of any text towards the bias direction, thus manipulating the similarity of any text pair and misleading safeguards. Attackers can jailbreak the safeguards by appending magic words to user prompts and requiring LLMs to end answers with magic words. Experiments show that magic word attacks significantly degrade safeguard performance on…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 2

Strengths

The discovery and empirical validation of the non-uniform, biased distribution of text embeddings (Fig. 1) is a significant and insightful contribution. It provides a principled and elegant explanation for the existence of universal adversarial attacks, moving beyond simple heuristics. This observation itself is of high value to the representation learning. The paper is well-written and the finding on embedding is interesting. The authors also propose different methods driven by their finding.

Weaknesses

* Does correcting the bias harm the embedding model's performance on its primary tasks (e.g., semantic search, classification)? An empirical evaluation is necessary. The setting on bypassing safeguard may also not be so useful for real-world applicability. * The final step of Alg. 3 involves a Cartesian product of candidate tokens, which can lead to a combinatorial explosion. The practical limits on the magic word length and candidate size should be discussed. * The defense method has not been t

Reviewer 02Rating 4Confidence 3

Strengths

* this paper proposes a bias-direction analysis for text-embedding models, which is new * this paper offers a simple, train-free mitigation for the proposed attack

Weaknesses

* the usage of "magic" suffix has been proposed in other works (like GCG), making the contribution of this paper a bit incremental * using renormalization for defense is promising but its impact on diverse downstream retrieval/semantic tasks (beyond the reported classifiers) remains underexplored * lacks head-to-head experimental comparison with other whitebox attacks * some inherited limitations of whitebox attacks

Reviewer 03Rating 6Confidence 3

Strengths

1. The paper studies vulnerability from an interesting angle by finding the universal magic words. 2. The paper shows strong attack and defense results for both the attack and the defense strategy. 3. The paper is well motivated and well written.

Weaknesses

1. There lack of analysis on the possible number of magic words existing in a model. 2. The influence of repetition count, token length, or embedding normalization choices is not systematically analyzed. 3. There lack of analysis on the randomness in learning magic words across different random seeds, etc. 4. There lack of discussion on the origin/root/insights of the identified magic words.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital and Cyber Forensics · Artificial Intelligence in Law · Digital Rights Management and Security

MethodsSoftmax · Attention Is All You Need