AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models
Aashray Reddy, Andrew Zagula, Nicholas Saban

TL;DR
AutoAdv introduces an automated, multi-turn adversarial prompting framework that systematically uncovers vulnerabilities in large language models' safety mechanisms, revealing high success rates of jailbreak attacks across popular models.
Contribution
The paper presents a novel automated multi-turn attack methodology that iteratively refines prompts to expose safety vulnerabilities in LLMs, advancing adversarial testing techniques.
Findings
Achieved up to 86% success rate in jailbreak attacks.
Revealed significant safety vulnerabilities in state-of-the-art LLMs.
Demonstrated effectiveness of automated multi-turn prompting in adversarial evaluation.
Abstract
Large Language Models (LLMs) continue to exhibit vulnerabilities to jailbreaking attacks: carefully crafted malicious inputs intended to circumvent safety guardrails and elicit harmful responses. As such, we present AutoAdv, a novel framework that automates adversarial prompt generation to systematically evaluate and expose vulnerabilities in LLM safety mechanisms. Our approach leverages a parametric attacker LLM to produce semantically disguised malicious prompts through strategic rewriting techniques, specialized system prompts, and optimized hyperparameter configurations. The primary contribution of our work is a dynamic, multi-turn attack methodology that analyzes failed jailbreak attempts and iteratively generates refined follow-up prompts, leveraging techniques such as roleplaying, misdirection, and contextual manipulation. We quantitatively evaluate attack success rate (ASR)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection · Topic Modeling
