TurboFuzzLLM: Turbocharging Mutation-based Fuzzing for Effectively Jailbreaking Large Language Models in Practice
Aman Goel, Xian Carrie Wu, Zhe Wang, Dmitriy Bespalov, Yanjun Qi

TL;DR
TurboFuzzLLM is a mutation-based fuzzing method that efficiently generates jailbreaking prompts for large language models, achieving high success rates and improving model defenses against prompt-based attacks.
Contribution
The paper introduces TurboFuzzLLM, a novel mutation-based fuzzing technique with functional and efficiency improvements for automatic jailbreaking template generation.
Findings
Achieves ≥95% attack success rate on GPT-4 and GPT-4 Turbo.
Demonstrates strong generalizability to unseen harmful questions.
Helps improve defenses against prompt attacks.
Abstract
Jailbreaking large-language models (LLMs) involves testing their robustness against adversarial prompts and evaluating their ability to withstand prompt attacks that could elicit unauthorized or malicious responses. In this paper, we present TurboFuzzLLM, a mutation-based fuzzing technique for efficiently finding a collection of effective jailbreaking templates that, when combined with harmful questions, can lead a target LLM to produce harmful responses through black-box access via user prompts. We describe the limitations of directly applying existing template-based attacking techniques in practice, and present functional and efficiency-focused upgrades we added to mutation-based fuzzing to generate effective jailbreaking templates automatically. TurboFuzzLLM achieves 95\% attack success rates (ASR) on public datasets for leading LLMs (including GPT-4o \& GPT-4 Turbo), shows…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Hate Speech and Cyberbullying Detection
MethodsAbsolute Position Encodings · Dense Connections · Linear Layer · Layer Normalization · Byte Pair Encoding · Residual Connection · Label Smoothing · Attention Is All You Need · Multi-Head Attention · Position-Wise Feed-Forward Layer
