AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models
Aashray Reddy, Andrew Zagula, Nicholas Saban

TL;DR
AutoAdv introduces a training-free, adaptive framework for multi-turn jailbreaking of large language models, achieving high success rates and exposing vulnerabilities in current safety measures across various models.
Contribution
The paper presents AutoAdv, a novel multi-turn attack framework that outperforms single-turn methods and reveals persistent safety vulnerabilities in LLMs.
Findings
AutoAdv achieves up to 95% success rate on Llama-3.1-8B.
Multi-turn attacks outperform single-turn approaches.
Current safety mechanisms are ineffective against multi-turn jailbreaks.
Abstract
Large Language Models (LLMs) remain vulnerable to jailbreaking attacks where adversarial prompts elicit harmful outputs. Yet most evaluations focus on single-turn interactions while real-world attacks unfold through adaptive multi-turn conversations. We present AutoAdv, a training-free framework for automated multi-turn jailbreaking that achieves an attack success rate of up to 95% on Llama-3.1-8B within six turns, a 24% improvement over single-turn baselines. AutoAdv uniquely combines three adaptive mechanisms: a pattern manager that learns from successful attacks to enhance future prompts, a temperature manager that dynamically adjusts sampling parameters based on failure modes, and a two-phase rewriting strategy that disguises harmful requests and then iteratively refines them. Extensive evaluation across commercial and open-source models (Llama-3.1-8B, GPT-4o mini, Qwen3-235B,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Hate Speech and Cyberbullying Detection
