Jigsaw Puzzles: Splitting Harmful Questions to Jailbreak Large Language Models
Hao Yang, Lizhen Qu, Ehsan Shareghi, Gholamreza Haffari

TL;DR
This paper introduces Jigsaw Puzzles, a multi-turn jailbreak method that effectively bypasses safeguards in large language models by splitting harmful questions into harmless parts, revealing vulnerabilities in current defenses.
Contribution
The paper proposes a novel multi-turn jailbreak strategy called Jigsaw Puzzles that significantly improves attack success rates against advanced LLMs and exposes weaknesses in existing safety measures.
Findings
Jigsaw Puzzles achieves over 93% success rate on 189 harmful queries.
The method surpasses previous attack success benchmarks, reaching 92% on GPT-4.
JSP demonstrates strong resistance to current defense strategies.
Abstract
Large language models (LLMs) have exhibited outstanding performance in engaging with humans and addressing complex questions by leveraging their vast implicit knowledge and robust reasoning capabilities. However, such models are vulnerable to jailbreak attacks, leading to the generation of harmful responses. Despite recent research on single-turn jailbreak strategies to facilitate the development of defence mechanisms, the challenge of revealing vulnerabilities under multi-turn setting remains relatively under-explored. In this work, we propose Jigsaw Puzzles (JSP), a straightforward yet effective multi-turn jailbreak strategy against the advanced LLMs. JSP splits questions into harmless fractions as the input of each turn, and requests LLMs to reconstruct and respond to questions under multi-turn interaction. Our experimental results demonstrate that the proposed JSP jailbreak bypasses…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsForensic and Genetic Research · Digital Media Forensic Detection · Jury Decision Making Processes
MethodsAttention Is All You Need · Dropout · Layer Normalization · Adam · Dense Connections · Residual Connection · Position-Wise Feed-Forward Layer · Linear Layer · Byte Pair Encoding · Absolute Position Encodings
