PAPILLON: Efficient and Stealthy Fuzz Testing-Powered Jailbreaks for LLMs
Xueluan Gong, Mingzhe Li, Yilin Zhang, Fengyuan Ran, Chen Chen,, Yanjiao Chen, Qian Wang, Kwok-Yan Lam

TL;DR
PAPILLON is an automated, black-box fuzz testing framework that efficiently generates semantically coherent jailbreak prompts for LLMs, achieving high success rates with reduced prompt length and robustness against defenses.
Contribution
It introduces a novel fuzz testing-based approach for jailbreaking LLMs that eliminates manual template reliance and employs question-dependent mutations for effective, stealthy attacks.
Findings
Achieves over 90% success rate on GPT-3.5 turbo and GPT-4.
Reduces jailbreak prompt length significantly while maintaining high semantic coherence.
Demonstrates transferability and robustness against defenses.
Abstract
Large Language Models (LLMs) have excelled in various tasks but are still vulnerable to jailbreaking attacks, where attackers create jailbreak prompts to mislead the model to produce harmful or offensive content. Current jailbreak methods either rely heavily on manually crafted templates, which pose challenges in scalability and adaptability, or struggle to generate semantically coherent prompts, making them easy to detect. Additionally, most existing approaches involve lengthy prompts, leading to higher query costs. In this paper, to remedy these challenges, we introduce a novel jailbreaking attack framework called PAPILLON, which is an automated, black-box jailbreaking attack framework that adapts the black-box fuzz testing approach with a series of customized designs. Instead of relying on manually crafted templates,PAPILLON starts with an empty seed pool, removing the need to search…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Software Testing and Debugging Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Cosine Annealing · {Dispute@FaQ-s}How to file a dispute with Expedia? · Linear Layer · Weight Decay · Position-Wise Feed-Forward Layer · Label Smoothing · Linear Warmup With Cosine Annealing
