LLMStinger: Jailbreaking LLMs using RL fine-tuned LLMs
Piyush Jha, Arnav Arora, Vijay Ganesh

TL;DR
LLMStinger uses reinforcement learning to fine-tune attacker LLMs, automatically generating adversarial suffixes that significantly improve jailbreak attack success rates across multiple large language models.
Contribution
This paper introduces a reinforcement learning-based method for automatically generating adversarial prompts, outperforming existing red-teaming techniques in jailbreak success rates.
Findings
Achieved +57.2% ASR on LLaMA2-7B-chat
Achieved +50.3% ASR on Claude 2
High success rates on GPT-3.5 and Gemma-2B-it
Abstract
We introduce LLMStinger, a novel approach that leverages Large Language Models (LLMs) to automatically generate adversarial suffixes for jailbreak attacks. Unlike traditional methods, which require complex prompt engineering or white-box access, LLMStinger uses a reinforcement learning (RL) loop to fine-tune an attacker LLM, generating new suffixes based on existing attacks for harmful questions from the HarmBench benchmark. Our method significantly outperforms existing red-teaming approaches (we compared against 15 of the latest methods), achieving a +57.2% improvement in Attack Success Rate (ASR) on LLaMA2-7B-chat and a +50.3% ASR increase on Claude 2, both models known for their extensive safety measures. Additionally, we achieved a 94.97% ASR on GPT-3.5 and 99.4% on Gemma-2B-it, demonstrating the robustness and adaptability of LLMStinger across open and closed-source models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Law
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · {Dispute@FaQ-s}How to file a dispute with Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Linear Layer · Cosine Annealing · Layer Normalization · Adam · Attention Dropout · Multi-Head Attention
