Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models
Ziqi Miao, Lijun Li, Yuan Xiong, Zhenhua Liu, Pengyu Zhu, Jing Shao

TL;DR
This paper reveals a new vulnerability in large language models where previous dialogue responses can be exploited to induce policy-violating outputs, introducing a novel attack method called Response Attack that outperforms existing jailbreak techniques.
Contribution
The paper introduces Response Attack, a novel framework that leverages intermediate responses to effectively jailbreak LLMs, demonstrating higher success rates than existing methods.
Findings
Response Attack achieves higher success rates than baseline methods.
Intermediate responses significantly influence model outputs.
The attack maintains stealth and efficiency across multiple LLMs.
Abstract
Contextual priming, where earlier stimuli covertly bias later judgments, offers an unexplored attack surface for large language models (LLMs). We uncover a contextual priming vulnerability in which the previous response in the dialogue can steer its subsequent behavior toward policy-violating content. While existing jailbreak attacks largely rely on single-turn or multi-turn prompt manipulations, or inject static in-context examples, these methods suffer from limited effectiveness, inefficiency, or semantic drift. We introduce Response Attack (RA), a novel framework that strategically leverages intermediate, mildly harmful responses as contextual primers within a dialogue. By reformulating harmful queries and injecting these intermediate responses before issuing a targeted trigger prompt, RA exploits a previously overlooked vulnerability in LLMs. Extensive experiments across eight…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Adversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection
