One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue

Xinjie Shen; Rongzhe Wei; Peizhi Niu; Haoyu Wang; Ruihan Wu; Eli Chien; Bo Li; Pin-Yu Chen; Pan Li

arXiv:2605.05630·cs.CL·May 13, 2026

One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue

Xinjie Shen, Rongzhe Wei, Peizhi Niu, Haoyu Wang, Ruihan Wu, Eli Chien, Bo Li, Pin-Yu Chen, Pan Li

PDF

1 Repo 1 Models 1 Datasets

TL;DR

This paper introduces a method to detect the earliest turn in multi-turn dialogues that enables harmful actions, using a new dataset and a turn-level monitor to improve safety in large language models.

Contribution

The authors propose TurnGate, a turn-level monitor for harmful intent detection, supported by the Multi-Turn Intent Dataset (MTID), enhancing safety measures against distributed malicious prompts.

Findings

01

TurnGate outperforms existing baselines in harmful intent detection.

02

MTID enables effective training and evaluation of turn-level safety monitors.

03

TurnGate generalizes well across different domains and attacker pipelines.

Abstract

Hidden malicious intent in multi-turn dialogue poses a growing threat to deployed large language models (LLMs). Rather than exposing a harmful objective in a single prompt, increasingly capable attackers can distribute their intent across multiple benign-looking turns. Recent studies show that even modern commercial models with advanced guardrails remain vulnerable to such attacks despite advances in safety alignment and external guardrails. In this work, we address this challenge by detecting the earliest turn at which delivering the candidate response would make the accumulated interaction sufficient to enable harmful action. This objective requires precise turn-level intervention that identifies the harm-enabling closure point while avoiding premature refusal of benign exploratory conversations. To further support training and evaluation, we construct the Multi-Turn Intent Dataset…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Graph-COM/TurnGate
github

Models

🤗
Graph-COM/TurnGate-0.1
model· ♡ 1
♡ 1

Datasets

Graph-COM/MTID
dataset· 139 dl
139 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.