Jailbreaking Frontier Foundation Models Through Intention Deception

Xinhe Wang; Katia Sycara; Yaqi Xie

arXiv:2604.24082·cs.CR·April 28, 2026

Jailbreaking Frontier Foundation Models Through Intention Deception

Xinhe Wang, Katia Sycara, Yaqi Xie

PDF

TL;DR

This paper introduces a multi-turn jailbreaking method exploiting vulnerabilities in frontier models' safe completion, revealing para-jailbreaking and outperforming state-of-the-art models in multimodal settings.

Contribution

It presents a novel multi-turn attack exploiting model trust and introduces para-jailbreaking, uncovering new vulnerabilities in frontier models like GPT-5 and Claude.

Findings

01

High success rates against GPT-5 and Claude models

02

Uncovered and addressed para-jailbreaking vulnerabilities

03

Outperformed state-of-the-art multimodal models

Abstract

Large (vision-)language models exhibit remarkable capability but remain highly susceptible to jailbreaking. Existing safety training approaches aim to have the model learn a refusal boundary between safe and unsafe, based on the user's intent. It has been found that this binary training regime often leads to brittleness, since the user intent cannot reliably be evaluated, especially if the attacker obfuscates their intent, and also makes the system seem unhelpful. In response, frontier models, such as GPT-5, have shifted from refusal-based safeguards to safe completion, that aims to maximize helpfulness while obeying safety constraints. However, safe completion could be exploited when a user pretends their intention is benign. Specifically, this intent inversion would be effective in multi-turn conversation, where the attacker has multiple opportunities to reinforce their deceptively…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.