SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks

Mingqian Feng; Xiaodong Liu; Weiwei Yang; Jialin Song; Xuekai Zhu; Chenliang Xu; and Jianfeng Gao

arXiv:2602.06854·cs.CL·February 9, 2026

SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks

Mingqian Feng, Xiaodong Liu, Weiwei Yang, Jialin Song, Xuekai Zhu, Chenliang Xu, and Jianfeng Gao

PDF

Open Access 3 Reviews

TL;DR

SEMA introduces a simple, effective multi-turn attack framework for safety-aligned chatbots, significantly improving attack success rates without relying on external data or complex strategies, thus providing a robust stress test for LLM safety.

Contribution

The paper presents SEMA, a novel multi-turn jailbreak attack method that stabilizes training through self-generated prompts and achieves state-of-the-art success rates without external data.

Findings

01

SEMA achieves an average 80.1% ASR@1 on AdvBench, outperforming baselines by 33.9%.

02

The method is compact, reproducible, and transferable across different models.

03

SEMA provides a stronger, more realistic stress test for LLM safety.

Abstract

Multi-turn jailbreaks capture the real threat model for safety-aligned chatbots, where single-turn attacks are merely a special case. Yet existing approaches break under exploration complexity and intent drift. We propose SEMA, a simple yet effective framework that trains a multi-turn attacker without relying on any existing strategies or external data. SEMA comprises two stages. Prefilling self-tuning enables usable rollouts by fine-tuning on non-refusal, well-structured, multi-turn adversarial prompts that are self-generated with a minimal prefix, thereby stabilizing subsequent learning. Reinforcement learning with intent-drift-aware reward trains the attacker to elicit valid multi-turn adversarial prompts while maintaining the same harmful objective. We anchor harmful intent in multi-turn jailbreaks via an intent-drift-aware reward that combines intent alignment, compliance risk, and…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1. This paper is clearly written and provides a well-defined explanation of the proposed approach. 2. The proposed intent-drift-aware reward and GRPO-based jailbreaking method is simple yet effective and novel. 3. The experiments are comprehensive, and the experiments involving various baselines and models sufficiently demonstrate the superiority of the proposed methodology.

Weaknesses

Major 1. The open-loop assumption side-steps the real feedback dynamics where victim replies steer the attacker (including deflections). While this is computationally attractive, it may overestimate transferability to real attackers who adapt turn-by-turn. A head-to-head closed-loop variant of SEMA (same reward and intent anchor, but conditioned on last victim response) would clarify the realism/efficiency trade-off. 2. The intent-drift-aware reward is central but depends on an evaluation model

Reviewer 02Rating 2Confidence 5

Strengths

- This work proposes a decent multi-turn jailbreak framework that achieves higher ASR compared to reported single-turn and multi-turn jailbreak methods. - The evaluation is relatively thorough, testing many open and closed models across two solid benchmarks. - Results show that SEMA achieves higher ASR compared to compared single-turn and multi-turn methods. - The visual presentations of this paper are effective for conveying the mechanism of the framework as well as delivering core takeaways

Weaknesses

- The paper’s scope is limited by its exclusive focus on developing attackers without accompanying defensive methods. While SEMA advances the study of multi-turn jailbreaks, it offers no systematic exploration of countermeasures or co-evolving defenses. As a result, the work demonstrates how to break safety mechanisms effectively but provides little insight into how to strengthen or adapt them, narrowing its overall contribution to LLM safety research. - This works claims to achieve SOTA attack

Reviewer 03Rating 6Confidence 3

Strengths

Strong transferability: The method demonstrates high transfer rates across different victim models, suggesting the learned attacks capture generalizable vulnerabilities rather than model-specific artifacts. Simplified threat model: The open-loop generation approach reduces computational requirements by avoiding the need for iterative victim interaction during attack generation. This also removes dependencies on predefined strategy templates or branching assumptions that constrain template-drive

Weaknesses

*Missing cost analysis*: Despite frequent mentions of reduced cost as a key advantage, the paper lacks quantitative analysis of computational requirements. Specifically: 1. How many prompts need to be generated on average during training and inference? 2. What are the API costs for the evaluation model (GPT-4.1-mini) during training? 3. How does the total cost compare to baseline methods like Crescendo or GOAT? 4. What is the cost breakdown between prefilling self-tuning and RL stages? *Incom

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Spam and Phishing Detection · Advanced Malware Detection Techniques