AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models

Aashray Reddy; Andrew Zagula; Nicholas Saban

arXiv:2507.01020·cs.CR·December 25, 2025

AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models

Aashray Reddy, Andrew Zagula, Nicholas Saban

PDF

Open Access

TL;DR

AutoAdv introduces an automated, multi-turn adversarial prompting framework that systematically uncovers vulnerabilities in large language models' safety mechanisms, revealing high success rates of jailbreak attacks across popular models.

Contribution

The paper presents a novel automated multi-turn attack methodology that iteratively refines prompts to expose safety vulnerabilities in LLMs, advancing adversarial testing techniques.

Findings

01

Achieved up to 86% success rate in jailbreak attacks.

02

Revealed significant safety vulnerabilities in state-of-the-art LLMs.

03

Demonstrated effectiveness of automated multi-turn prompting in adversarial evaluation.

Abstract

Large Language Models (LLMs) continue to exhibit vulnerabilities to jailbreaking attacks: carefully crafted malicious inputs intended to circumvent safety guardrails and elicit harmful responses. As such, we present AutoAdv, a novel framework that automates adversarial prompt generation to systematically evaluate and expose vulnerabilities in LLM safety mechanisms. Our approach leverages a parametric attacker LLM to produce semantically disguised malicious prompts through strategic rewriting techniques, specialized system prompts, and optimized hyperparameter configurations. The primary contribution of our work is a dynamic, multi-turn attack methodology that analyzes failed jailbreak attempts and iteratively generates refined follow-up prompts, leveraging techniques such as roleplaying, misdirection, and contextual manipulation. We quantitatively evaluate attack success rate (ASR)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection · Topic Modeling