Jigsaw Puzzles: Splitting Harmful Questions to Jailbreak Large Language   Models

Hao Yang; Lizhen Qu; Ehsan Shareghi; Gholamreza Haffari

arXiv:2410.11459·cs.CL·October 16, 2024

Jigsaw Puzzles: Splitting Harmful Questions to Jailbreak Large Language Models

Hao Yang, Lizhen Qu, Ehsan Shareghi, Gholamreza Haffari

PDF

Open Access 1 Repo

TL;DR

This paper introduces Jigsaw Puzzles, a multi-turn jailbreak method that effectively bypasses safeguards in large language models by splitting harmful questions into harmless parts, revealing vulnerabilities in current defenses.

Contribution

The paper proposes a novel multi-turn jailbreak strategy called Jigsaw Puzzles that significantly improves attack success rates against advanced LLMs and exposes weaknesses in existing safety measures.

Findings

01

Jigsaw Puzzles achieves over 93% success rate on 189 harmful queries.

02

The method surpasses previous attack success benchmarks, reaching 92% on GPT-4.

03

JSP demonstrates strong resistance to current defense strategies.

Abstract

Large language models (LLMs) have exhibited outstanding performance in engaging with humans and addressing complex questions by leveraging their vast implicit knowledge and robust reasoning capabilities. However, such models are vulnerable to jailbreak attacks, leading to the generation of harmful responses. Despite recent research on single-turn jailbreak strategies to facilitate the development of defence mechanisms, the challenge of revealing vulnerabilities under multi-turn setting remains relatively under-explored. In this work, we propose Jigsaw Puzzles (JSP), a straightforward yet effective multi-turn jailbreak strategy against the advanced LLMs. JSP splits questions into harmless fractions as the input of each turn, and requests LLMs to reconstruct and respond to questions under multi-turn interaction. Our experimental results demonstrate that the proposed JSP jailbreak bypasses…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yanghao97/jigsawpuzzles
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsForensic and Genetic Research · Digital Media Forensic Detection · Jury Decision Making Processes

MethodsAttention Is All You Need · Dropout · Layer Normalization · Adam · Dense Connections · Residual Connection · Position-Wise Feed-Forward Layer · Linear Layer · Byte Pair Encoding · Absolute Position Encodings