BreakFun: Jailbreaking LLMs via Schema Exploitation
Amirkia Rafiei Oskooei, Mehmet S. Aktas

TL;DR
This paper introduces BreakFun, a schema-based jailbreak method exploiting LLMs' adherence to structured data, demonstrating high success rates and proposing a guardrail defense to mitigate such vulnerabilities.
Contribution
We present a novel schema exploitation attack on LLMs and a defense mechanism using adversarial prompt deconstruction to enhance model robustness.
Findings
BreakFun achieves up to 100% success rate on several models.
The Trojan Schema is identified as the primary causal factor.
The proposed guardrail effectively detects and mitigates schema-based attacks.
Abstract
The proficiency of Large Language Models (LLMs) in processing structured data and adhering to syntactic rules is a capability that drives their widespread adoption but also makes them paradoxically vulnerable. In this paper, we investigate this vulnerability through BreakFun, a jailbreak methodology that weaponizes an LLM's adherence to structured schemas. BreakFun employs a three-part prompt that combines an innocent framing and a Chain-of-Thought distraction with a core "Trojan Schema"--a carefully crafted data structure that compels the model to generate harmful content, exploiting the LLM's strong tendency to follow structures and schemas. We demonstrate this vulnerability is highly transferable, achieving an average success rate of 89% across 13 foundational and proprietary models on JailbreakBench, and reaching a 100% Attack Success Rate (ASR) on several prominent models. A…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Advanced Malware Detection Techniques
