Jailbreaking Frontier Foundation Models Through Intention Deception
Xinhe Wang, Katia Sycara, Yaqi Xie

TL;DR
This paper introduces a multi-turn jailbreaking method exploiting vulnerabilities in frontier models' safe completion, revealing para-jailbreaking and outperforming state-of-the-art models in multimodal settings.
Contribution
It presents a novel multi-turn attack exploiting model trust and introduces para-jailbreaking, uncovering new vulnerabilities in frontier models like GPT-5 and Claude.
Findings
High success rates against GPT-5 and Claude models
Uncovered and addressed para-jailbreaking vulnerabilities
Outperformed state-of-the-art multimodal models
Abstract
Large (vision-)language models exhibit remarkable capability but remain highly susceptible to jailbreaking. Existing safety training approaches aim to have the model learn a refusal boundary between safe and unsafe, based on the user's intent. It has been found that this binary training regime often leads to brittleness, since the user intent cannot reliably be evaluated, especially if the attacker obfuscates their intent, and also makes the system seem unhelpful. In response, frontier models, such as GPT-5, have shifted from refusal-based safeguards to safe completion, that aims to maximize helpfulness while obeying safety constraints. However, safe completion could be exploited when a user pretends their intention is benign. Specifically, this intent inversion would be effective in multi-turn conversation, where the attacker has multiple opportunities to reinforce their deceptively…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
