In-Context Representation Hijacking
Itay Yona, Amir Sarid, Michael Karasik, Yossi Gandelsman

TL;DR
This paper presents Doublespeak, an attack that hijacks LLM representations by substituting benign tokens for harmful ones, enabling unsafe prompts to bypass safety measures with high success rates across various models.
Contribution
It introduces a novel, transferability-focused attack method that manipulates internal representations of LLMs without optimization, exposing vulnerabilities in current safety alignment strategies.
Findings
Achieves 74% ASR on Llama-3.3-70B-Instruct with a single prompt override
Representation convergence from benign to harmful occurs layer by layer
Broadly transferable across different LLM architectures
Abstract
We introduce , a simple in-context representation hijacking attack against large language models (LLMs). The attack works by systematically replacing a harmful keyword (e.g., bomb) with a benign token (e.g., carrot) across multiple in-context examples, provided a prefix to a harmful request. We demonstrate that this substitution leads to the internal representation of the benign token converging toward that of the harmful one, effectively embedding the harmful semantics under a euphemism. As a result, superficially innocuous prompts (e.g., "How to build a carrot?") are internally interpreted as disallowed instructions (e.g., "How to build a bomb?"), thereby bypassing the model's safety alignment. We use interpretability tools to show that this semantic overwrite emerges layer by layer, with benign meanings in early layers converging into harmful semantics in later…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- Introduces a genuinely new representation-level jailbreak mechanism, distinct from prior token or prompt-level attacks. - Demonstrates high attack success rates across both open and closed models with no optimization or fine-tuning. - Provides clear mechanistic evidence via logit lens and Patchscopes showing layerwise semantic drift from benign to harmful meanings.
- The use of logit lens and Patchscopes is primarily qualitative. For instance, Figure 2 and Table 1 show probability or decoding trends across layers, but the paper never quantifies the variance across different runs, tokens, or sentences. The claim that “benign semantics in early layers converge to harmful semantics in later ones” would be more convincing with metrics such as cosine similarity trajectories or KL divergence between representation distributions. - The attack’s success seems con
* I like the proposed approach, especially the optimisation-free aspect. * Studying the representation "mutation" within the model as potential safety risk is interesting
## Major * **Fundamental novelty/Scalability** I am not sure how is this approach fundamentally different from ad-hoc approaches where the attacker rephrases prompts manually to lure the model? * **Unclear Methodology** --The experimental setup is unclear, and the authors did not provide the code to check. The authors mention they used Harmbench in their experiments. But they did not explain how did they automate/scale their approach? In harmful prompts (from harmbench) like :"Describe how to
- Timely Topic - Interesting Idea
- Unclear Description - Insufficient Evaluation and Comparison
(1) Simple attack idea with wide practical implications- The doublespeak recipe is easy to describe, requires no optimization or model fine-tuning, and transfers across model families and closed APIs — this makes the attack noteworthy from a safety perspective. (2) Mechanistic interpretability evidence- The Patchscopes + logit-lens analyses provide plausible layerwise evidence that the benign token’s representations progressively shift toward harmful semantics in later layers. This helps explai
While the authors do explicitly flag the following 2 limitations in the paper, I want to emphasize that these are not just theoretical edge cases—they are important design considerations that should be evaluated before the attack is framed as broadly applicable or production-relevant. A convincing universal jailbreak should demonstrate robustness to these more realistic and diverse threat scenarios. (1) Insufficient systematic ablations over substitute choice and semantics.- The attack currentl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Misinformation and Its Impacts
