In-Context Representation Hijacking

Itay Yona; Amir Sarid; Michael Karasik; Yossi Gandelsman

arXiv:2512.03771·cs.CL·December 5, 2025

In-Context Representation Hijacking

Itay Yona, Amir Sarid, Michael Karasik, Yossi Gandelsman

PDF

Open Access 4 Reviews

TL;DR

This paper presents Doublespeak, an attack that hijacks LLM representations by substituting benign tokens for harmful ones, enabling unsafe prompts to bypass safety measures with high success rates across various models.

Contribution

It introduces a novel, transferability-focused attack method that manipulates internal representations of LLMs without optimization, exposing vulnerabilities in current safety alignment strategies.

Findings

01

Achieves 74% ASR on Llama-3.3-70B-Instruct with a single prompt override

02

Representation convergence from benign to harmful occurs layer by layer

03

Broadly transferable across different LLM architectures

Abstract

We introduce $Doublespeak$ , a simple in-context representation hijacking attack against large language models (LLMs). The attack works by systematically replacing a harmful keyword (e.g., bomb) with a benign token (e.g., carrot) across multiple in-context examples, provided a prefix to a harmful request. We demonstrate that this substitution leads to the internal representation of the benign token converging toward that of the harmful one, effectively embedding the harmful semantics under a euphemism. As a result, superficially innocuous prompts (e.g., "How to build a carrot?") are internally interpreted as disallowed instructions (e.g., "How to build a bomb?"), thereby bypassing the model's safety alignment. We use interpretability tools to show that this semantic overwrite emerges layer by layer, with benign meanings in early layers converging into harmful semantics in later…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 3

Strengths

- Introduces a genuinely new representation-level jailbreak mechanism, distinct from prior token or prompt-level attacks. - Demonstrates high attack success rates across both open and closed models with no optimization or fine-tuning. - Provides clear mechanistic evidence via logit lens and Patchscopes showing layerwise semantic drift from benign to harmful meanings.

Weaknesses

- The use of logit lens and Patchscopes is primarily qualitative. For instance, Figure 2 and Table 1 show probability or decoding trends across layers, but the paper never quantifies the variance across different runs, tokens, or sentences. The claim that “benign semantics in early layers converge to harmful semantics in later ones” would be more convincing with metrics such as cosine similarity trajectories or KL divergence between representation distributions. - The attack’s success seems con

Reviewer 02Rating 2Confidence 4

Strengths

* I like the proposed approach, especially the optimisation-free aspect. * Studying the representation "mutation" within the model as potential safety risk is interesting

Weaknesses

## Major * **Fundamental novelty/Scalability** I am not sure how is this approach fundamentally different from ad-hoc approaches where the attacker rephrases prompts manually to lure the model? * **Unclear Methodology** --The experimental setup is unclear, and the authors did not provide the code to check. The authors mention they used Harmbench in their experiments. But they did not explain how did they automate/scale their approach? In harmful prompts (from harmbench) like :"Describe how to

Reviewer 03Rating 4Confidence 5

Strengths

- Timely Topic - Interesting Idea

Weaknesses

- Unclear Description - Insufficient Evaluation and Comparison

Reviewer 04Rating 4Confidence 4

Strengths

(1) Simple attack idea with wide practical implications- The doublespeak recipe is easy to describe, requires no optimization or model fine-tuning, and transfers across model families and closed APIs — this makes the attack noteworthy from a safety perspective. (2) Mechanistic interpretability evidence- The Patchscopes + logit-lens analyses provide plausible layerwise evidence that the benign token’s representations progressively shift toward harmful semantics in later layers. This helps explai

Weaknesses

While the authors do explicitly flag the following 2 limitations in the paper, I want to emphasize that these are not just theoretical edge cases—they are important design considerations that should be evaluated before the attack is framed as broadly applicable or production-relevant. A convincing universal jailbreak should demonstrate robustness to these more realistic and diverse threat scenarios. (1) Insufficient systematic ablations over substitute choice and semantics.- The attack currentl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Misinformation and Its Impacts