AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models

Aashray Reddy; Andrew Zagula; Nicholas Saban

arXiv:2511.02376·cs.CL·December 23, 2025

AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models

Aashray Reddy, Andrew Zagula, Nicholas Saban

PDF

Open Access

TL;DR

AutoAdv introduces a training-free, adaptive framework for multi-turn jailbreaking of large language models, achieving high success rates and exposing vulnerabilities in current safety measures across various models.

Contribution

The paper presents AutoAdv, a novel multi-turn attack framework that outperforms single-turn methods and reveals persistent safety vulnerabilities in LLMs.

Findings

01

AutoAdv achieves up to 95% success rate on Llama-3.1-8B.

02

Multi-turn attacks outperform single-turn approaches.

03

Current safety mechanisms are ineffective against multi-turn jailbreaks.

Abstract

Large Language Models (LLMs) remain vulnerable to jailbreaking attacks where adversarial prompts elicit harmful outputs. Yet most evaluations focus on single-turn interactions while real-world attacks unfold through adaptive multi-turn conversations. We present AutoAdv, a training-free framework for automated multi-turn jailbreaking that achieves an attack success rate of up to 95% on Llama-3.1-8B within six turns, a 24% improvement over single-turn baselines. AutoAdv uniquely combines three adaptive mechanisms: a pattern manager that learns from successful attacks to enhance future prompts, a temperature manager that dynamically adjusts sampling parameters based on failure modes, and a two-phase rewriting strategy that disguises harmful requests and then iteratively refines them. Extensive evaluation across commercial and open-source models (Llama-3.1-8B, GPT-4o mini, Qwen3-235B,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Hate Speech and Cyberbullying Detection