SafeTy Reasoning Elicitation Alignment for Multi-Turn Dialogues

Martin Kuo; Jianyi Zhang; Aolin Ding; Louis DiValentin; Amin Hass; Benjamin F Morris; Isaac Jacobson; Randolph Linderman; James Kiessling; Nicolas Ramos; Bhavna Gopal; Maziyar Baran Pouyan; Changwei Liu; Hai Li; Yiran Chen

arXiv:2506.00668·cs.CL·June 3, 2025

SafeTy Reasoning Elicitation Alignment for Multi-Turn Dialogues

Martin Kuo, Jianyi Zhang, Aolin Ding, Louis DiValentin, Amin Hass, Benjamin F Morris, Isaac Jacobson, Randolph Linderman, James Kiessling, Nicolas Ramos, Bhavna Gopal, Maziyar Baran Pouyan, Changwei Liu, Hai Li, Yiran Chen

PDF

Open Access 1 Datasets

TL;DR

This paper introduces STREAM, a novel safety alignment method that detects malicious multi-turn dialogues to protect large language models from attacks while maintaining their functionality.

Contribution

The paper presents a new safety reasoning moderator trained on a human-annotated dataset to effectively identify malicious intent in multi-turn conversations.

Findings

01

Reduces attack success rate by 51.2%

02

Outperforms existing defense techniques

03

Maintains LLM capabilities

Abstract

Malicious attackers can exploit large language models (LLMs) by engaging them in multi-turn dialogues to achieve harmful objectives, posing significant safety risks to society. To address this challenge, we propose a novel defense mechanism: SafeTy Reasoning Elicitation Alignment for Multi-Turn Dialogues (STREAM). STREAM defends LLMs against multi-turn attacks while preserving their functional capabilities. Our approach involves constructing a human-annotated dataset, the Safety Reasoning Multi-turn Dialogues dataset, which is used to fine-tune a plug-and-play safety reasoning moderator. This model is designed to identify malicious intent hidden within multi-turn conversations and alert the target LLM of potential risks. We evaluate STREAM across multiple LLMs against prevalent multi-turn attack strategies. Experimental results demonstrate that our method significantly outperforms…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

DukeCEICenter/Safety_Reasoning_Multi_Turn_Dialogue
dataset· 32 dl
32 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems