PUZZLED: Jailbreaking LLMs through Word-Based Puzzles
Yelim Ahn, Jaejin Lee

TL;DR
PUZZLED is a novel jailbreak technique that uses word puzzles to mask harmful instructions, exploiting LLMs' reasoning skills to bypass safety measures with a high success rate.
Contribution
It introduces a new puzzle-based approach to jailbreak LLMs, demonstrating high effectiveness across multiple state-of-the-art models.
Findings
Achieves an average attack success rate of 88.8%
High success rate of 96.5% on GPT-4.1
Effective across five different LLMs
Abstract
As large language models (LLMs) are increasingly deployed across diverse domains, ensuring their safety has become a critical concern. In response, studies on jailbreak attacks have been actively growing. Existing approaches typically rely on iterative prompt engineering or semantic transformations of harmful instructions to evade detection. In this work, we introduce PUZZLED, a novel jailbreak method that leverages the LLM's reasoning capabilities. It masks keywords in a harmful instruction and presents them as word puzzles for the LLM to solve. We design three puzzle types-word search, anagram, and crossword-that are familiar to humans but cognitively demanding for LLMs. The model must solve the puzzle to uncover the masked words and then proceed to generate responses to the reconstructed harmful instruction. We evaluate PUZZLED on five state-of-the-art LLMs and observe a high average…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)
