Siren: A Learning-Based Multi-Turn Attack Framework for Simulating Real-World Human Jailbreak Behaviors
Yi Zhao, Youzhi Zhang

TL;DR
Siren is a learning-based framework that simulates real-world multi-turn jailbreak attacks on large language models, outperforming existing single-turn methods and providing insights for developing stronger defenses.
Contribution
The paper introduces Siren, a novel multi-turn attack framework that uses learning-based strategies to better mimic human jailbreak behaviors in LLMs, surpassing prior static or single-turn approaches.
Findings
Achieves 90% attack success rate against Gemini-1.5-Pro
Attains 70% success against GPT-4o with Mistral-7B attacker
Performs comparably to multi-turn GPT-4o-based attack with fewer turns
Abstract
Large language models (LLMs) are widely used in real-world applications, raising concerns about their safety and trustworthiness. While red-teaming with jailbreak prompts exposes the vulnerabilities of LLMs, current efforts focus primarily on single-turn attacks, overlooking the multi-turn strategies used by real-world adversaries. Existing multi-turn methods rely on static patterns or predefined logical chains, failing to account for the dynamic strategies during attacks. We propose Siren, a learning-based multi-turn attack framework designed to simulate real-world human jailbreak behaviors. Siren consists of three stages: (1) MiniMax-driven training set construction utilizing Turn-Level LLM feedback, (2) post-training attackers with supervised fine-tuning (SFT) and direct preference optimization (DPO), and (3) interactions between the attacking and target LLMs. Experiments demonstrate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCrime Patterns and Interventions · Digital and Cyber Forensics · Advanced Malware Detection Techniques
MethodsSinusoidal Representation Network · Sparse Evolutionary Training · Focus
