PUZZLED: Jailbreaking LLMs through Word-Based Puzzles

Yelim Ahn; Jaejin Lee

arXiv:2508.01306·cs.AI·August 5, 2025

PUZZLED: Jailbreaking LLMs through Word-Based Puzzles

Yelim Ahn, Jaejin Lee

PDF

Open Access

TL;DR

PUZZLED is a novel jailbreak technique that uses word puzzles to mask harmful instructions, exploiting LLMs' reasoning skills to bypass safety measures with a high success rate.

Contribution

It introduces a new puzzle-based approach to jailbreak LLMs, demonstrating high effectiveness across multiple state-of-the-art models.

Findings

01

Achieves an average attack success rate of 88.8%

02

High success rate of 96.5% on GPT-4.1

03

Effective across five different LLMs

Abstract

As large language models (LLMs) are increasingly deployed across diverse domains, ensuring their safety has become a critical concern. In response, studies on jailbreak attacks have been actively growing. Existing approaches typically rely on iterative prompt engineering or semantic transformations of harmful instructions to evade detection. In this work, we introduce PUZZLED, a novel jailbreak method that leverages the LLM's reasoning capabilities. It masks keywords in a harmful instruction and presents them as word puzzles for the LLM to solve. We design three puzzle types-word search, anagram, and crossword-that are familiar to humans but cognitively demanding for LLMs. The model must solve the puzzle to uncover the masked words and then proceed to generate responses to the reconstructed harmful instruction. We evaluate PUZZLED on five state-of-the-art LLMs and observe a high average…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)