LLMStinger: Jailbreaking LLMs using RL fine-tuned LLMs

Piyush Jha; Arnav Arora; Vijay Ganesh

arXiv:2411.08862·cs.LG·January 29, 2026

LLMStinger: Jailbreaking LLMs using RL fine-tuned LLMs

Piyush Jha, Arnav Arora, Vijay Ganesh

PDF

Open Access

TL;DR

LLMStinger uses reinforcement learning to fine-tune attacker LLMs, automatically generating adversarial suffixes that significantly improve jailbreak attack success rates across multiple large language models.

Contribution

This paper introduces a reinforcement learning-based method for automatically generating adversarial prompts, outperforming existing red-teaming techniques in jailbreak success rates.

Findings

01

Achieved +57.2% ASR on LLaMA2-7B-chat

02

Achieved +50.3% ASR on Claude 2

03

High success rates on GPT-3.5 and Gemma-2B-it

Abstract

We introduce LLMStinger, a novel approach that leverages Large Language Models (LLMs) to automatically generate adversarial suffixes for jailbreak attacks. Unlike traditional methods, which require complex prompt engineering or white-box access, LLMStinger uses a reinforcement learning (RL) loop to fine-tune an attacker LLM, generating new suffixes based on existing attacks for harmful questions from the HarmBench benchmark. Our method significantly outperforms existing red-teaming approaches (we compared against 15 of the latest methods), achieving a +57.2% improvement in Attack Success Rate (ASR) on LLaMA2-7B-chat and a +50.3% ASR increase on Claude 2, both models known for their extensive safety measures. Additionally, we achieved a 94.97% ASR on GPT-3.5 and 99.4% on Gemma-2B-it, demonstrating the robustness and adaptability of LLMStinger across open and closed-source models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Law

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · {Dispute@FaQ-s}How to file a dispute with Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Linear Layer · Cosine Annealing · Layer Normalization · Adam · Attention Dropout · Multi-Head Attention