TL;DR
This paper introduces MTSA, a multi-turn safety alignment framework for LLMs that uses multi-round red-teaming and reinforcement learning to improve robustness against jailbreak attacks in multi-turn dialogues.
Contribution
The paper presents a novel multi-turn safety alignment framework with thought-guided attack learning and iterative optimization, enhancing LLM safety in complex multi-round interactions.
Findings
Red-team model achieves state-of-the-art attack success rates.
Target model significantly improves safety benchmark performance.
Multi-turn reinforcement learning enhances robustness against jailbreaks.
Abstract
The proliferation of jailbreak attacks against large language models (LLMs) highlights the need for robust security measures. However, in multi-round dialogues, malicious intentions may be hidden in interactions, leading LLMs to be more prone to produce harmful responses. In this paper, we propose the \textbf{M}ulti-\textbf{T}urn \textbf{S}afety \textbf{A}lignment (\ourapproach) framework, to address the challenge of securing LLMs in multi-round interactions. It consists of two stages: In the thought-guided attack learning stage, the red-team model learns about thought-guided multi-round jailbreak attacks to generate adversarial prompts. In the adversarial iterative optimization stage, the red-team model and the target model continuously improve their respective capabilities in interaction. Furthermore, we introduce a multi-turn reinforcement learning algorithm based on future rewards…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
