MTSA: Multi-turn Safety Alignment for LLMs through Multi-round Red-teaming

Weiyang Guo; Jing Li; Wenya Wang; YU LI; Daojing He; Jun Yu; Min Zhang

arXiv:2505.17147·cs.CR·May 26, 2025

MTSA: Multi-turn Safety Alignment for LLMs through Multi-round Red-teaming

Weiyang Guo, Jing Li, Wenya Wang, YU LI, Daojing He, Jun Yu, Min Zhang

PDF

1 Repo 1 Video

TL;DR

This paper introduces MTSA, a multi-turn safety alignment framework for LLMs that uses multi-round red-teaming and reinforcement learning to improve robustness against jailbreak attacks in multi-turn dialogues.

Contribution

The paper presents a novel multi-turn safety alignment framework with thought-guided attack learning and iterative optimization, enhancing LLM safety in complex multi-round interactions.

Findings

01

Red-team model achieves state-of-the-art attack success rates.

02

Target model significantly improves safety benchmark performance.

03

Multi-turn reinforcement learning enhances robustness against jailbreaks.

Abstract

The proliferation of jailbreak attacks against large language models (LLMs) highlights the need for robust security measures. However, in multi-round dialogues, malicious intentions may be hidden in interactions, leading LLMs to be more prone to produce harmful responses. In this paper, we propose the \textbf{M}ulti-\textbf{T}urn \textbf{S}afety \textbf{A}lignment (\ourapproach) framework, to address the challenge of securing LLMs in multi-round interactions. It consists of two stages: In the thought-guided attack learning stage, the red-team model learns about thought-guided multi-round jailbreak attacks to generate adversarial prompts. In the adversarial iterative optimization stage, the red-team model and the target model continuously improve their respective capabilities in interaction. Furthermore, we introduce a multi-turn reinforcement learning algorithm based on future rewards…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yuki-younai/mtsa
pytorchOfficial

Videos

MTSA: Multi-turn Safety Alignment for LLMs through Multi-round Red-teaming· underline