Trojan Horse Prompting: Jailbreaking Conversational Multimodal Models by Forging Assistant Message
Wei Duan, Li Qian

TL;DR
This paper uncovers a new vulnerability in conversational multimodal models where adversaries forge past messages to bypass safety measures, demonstrating a significant security flaw in current AI systems.
Contribution
It introduces Trojan Horse Prompting, a novel attack method exploiting the model's trust in its conversational history, revealing a critical security weakness in modern AI safety alignment.
Findings
Trojan Horse Prompting achieves higher attack success rates than previous methods.
Models are vulnerable due to asymmetric safety alignment, trusting their own history.
Current safety mechanisms are insufficient against forged conversational context.
Abstract
The rise of conversational interfaces has greatly enhanced LLM usability by leveraging dialogue history for sophisticated reasoning. However, this reliance introduces an unexplored attack surface. This paper introduces Trojan Horse Prompting, a novel jailbreak technique. Adversaries bypass safety mechanisms by forging the model's own past utterances within the conversational history provided to its API. A malicious payload is injected into a model-attributed message, followed by a benign user prompt to trigger harmful content generation. This vulnerability stems from Asymmetric Safety Alignment: models are extensively trained to refuse harmful user requests but lack comparable skepticism towards their own purported conversational history. This implicit trust in its "past" creates a high-impact vulnerability. Experimental validation on Google's Gemini-2.0-flash-preview-image-generation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
