TurboFuzzLLM: Turbocharging Mutation-based Fuzzing for Effectively Jailbreaking Large Language Models in Practice

Aman Goel; Xian Carrie Wu; Zhe Wang; Dmitriy Bespalov; Yanjun Qi

arXiv:2502.18504·cs.CR·June 6, 2025

TurboFuzzLLM: Turbocharging Mutation-based Fuzzing for Effectively Jailbreaking Large Language Models in Practice

Aman Goel, Xian Carrie Wu, Zhe Wang, Dmitriy Bespalov, Yanjun Qi

PDF

Open Access 1 Repo 1 Video

TL;DR

TurboFuzzLLM is a mutation-based fuzzing method that efficiently generates jailbreaking prompts for large language models, achieving high success rates and improving model defenses against prompt-based attacks.

Contribution

The paper introduces TurboFuzzLLM, a novel mutation-based fuzzing technique with functional and efficiency improvements for automatic jailbreaking template generation.

Findings

01

Achieves ≥95% attack success rate on GPT-4 and GPT-4 Turbo.

02

Demonstrates strong generalizability to unseen harmful questions.

03

Helps improve defenses against prompt attacks.

Abstract

Jailbreaking large-language models (LLMs) involves testing their robustness against adversarial prompts and evaluating their ability to withstand prompt attacks that could elicit unauthorized or malicious responses. In this paper, we present TurboFuzzLLM, a mutation-based fuzzing technique for efficiently finding a collection of effective jailbreaking templates that, when combined with harmful questions, can lead a target LLM to produce harmful responses through black-box access via user prompts. We describe the limitations of directly applying existing template-based attacking techniques in practice, and present functional and efficiency-focused upgrades we added to mutation-based fuzzing to generate effective jailbreaking templates automatically. TurboFuzzLLM achieves $\geq$ 95\% attack success rates (ASR) on public datasets for leading LLMs (including GPT-4o \& GPT-4 Turbo), shows…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

amazon-science/TurboFuzzLLM
noneOfficial

Videos

TurboFuzzLLM: Turbocharging Mutation-based Fuzzing for Effectively Jailbreaking Large Language Models in Practice· underline

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Hate Speech and Cyberbullying Detection

MethodsAbsolute Position Encodings · Dense Connections · Linear Layer · Layer Normalization · Byte Pair Encoding · Residual Connection · Label Smoothing · Attention Is All You Need · Multi-Head Attention · Position-Wise Feed-Forward Layer