Automating Deception: Scalable Multi-Turn LLM Jailbreaks
Adarsh Kumarappan, Ananya Mujoo

TL;DR
This paper presents an automated method to generate large-scale multi-turn jailbreak datasets exploiting psychological principles, revealing significant vulnerabilities in most LLMs and highlighting the need for improved safety defenses.
Contribution
It introduces a scalable, automated pipeline for creating psychologically-grounded multi-turn jailbreak datasets and systematically evaluates model robustness against these attacks.
Findings
GPT models are highly vulnerable to multi-turn attacks with up to 32% increase in success rate.
Google's Gemini 2.5 Flash shows near immunity to multi-turn jailbreaks.
Model robustness varies significantly across different LLM families.
Abstract
Multi-turn conversational attacks, which leverage psychological principles like Foot-in-the-Door (FITD), where a small initial request paves the way for a more significant one, to bypass safety alignments, pose a persistent threat to Large Language Models (LLMs). Progress in defending against these attacks is hindered by a reliance on manual, hard-to-scale dataset creation. This paper introduces a novel, automated pipeline for generating large-scale, psychologically-grounded multi-turn jailbreak datasets. We systematically operationalize FITD techniques into reproducible templates, creating a benchmark of 1,500 scenarios across illegal activities and offensive content. We evaluate seven models from three major LLM families under both multi-turn (with history) and single-turn (without history) conditions. Our results reveal stark differences in contextual robustness: models in the GPT…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDeception detection and forensic psychology · Hate Speech and Cyberbullying Detection · Adversarial Robustness in Machine Learning
