Automating Deception: Scalable Multi-Turn LLM Jailbreaks

Adarsh Kumarappan; Ananya Mujoo

arXiv:2511.19517·cs.LG·March 10, 2026

Automating Deception: Scalable Multi-Turn LLM Jailbreaks

Adarsh Kumarappan, Ananya Mujoo

PDF

Open Access

TL;DR

This paper presents an automated method to generate large-scale multi-turn jailbreak datasets exploiting psychological principles, revealing significant vulnerabilities in most LLMs and highlighting the need for improved safety defenses.

Contribution

It introduces a scalable, automated pipeline for creating psychologically-grounded multi-turn jailbreak datasets and systematically evaluates model robustness against these attacks.

Findings

01

GPT models are highly vulnerable to multi-turn attacks with up to 32% increase in success rate.

02

Google's Gemini 2.5 Flash shows near immunity to multi-turn jailbreaks.

03

Model robustness varies significantly across different LLM families.

Abstract

Multi-turn conversational attacks, which leverage psychological principles like Foot-in-the-Door (FITD), where a small initial request paves the way for a more significant one, to bypass safety alignments, pose a persistent threat to Large Language Models (LLMs). Progress in defending against these attacks is hindered by a reliance on manual, hard-to-scale dataset creation. This paper introduces a novel, automated pipeline for generating large-scale, psychologically-grounded multi-turn jailbreak datasets. We systematically operationalize FITD techniques into reproducible templates, creating a benchmark of 1,500 scenarios across illegal activities and offensive content. We evaluate seven models from three major LLM families under both multi-turn (with history) and single-turn (without history) conditions. Our results reveal stark differences in contextual robustness: models in the GPT…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDeception detection and forensic psychology · Hate Speech and Cyberbullying Detection · Adversarial Robustness in Machine Learning