State-Dependent Safety Failures in Multi-Turn Language Model Interaction

Pengcheng Li; Jie Zhang; Tianwei Zhang; Han Qiu; Zhang kejun; Weiming Zhang; Nenghai Yu; Wenbo Zhou

arXiv:2603.15684·cs.CR·March 18, 2026

State-Dependent Safety Failures in Multi-Turn Language Model Interaction

Pengcheng Li, Jie Zhang, Tianwei Zhang, Han Qiu, Zhang kejun, Weiming Zhang, Nenghai Yu, Wenbo Zhou

PDF

Open Access

TL;DR

This paper investigates safety failures in multi-turn language model interactions, revealing that safety issues often stem from state evolution over dialogue rather than isolated prompts, and introduces a diagnostic framework called STAR.

Contribution

The paper introduces STAR, a state-oriented diagnostic framework for analyzing safety behavior over conversational trajectories in language models.

Findings

01

Safety failures often result from structured state evolution.

02

Models can appear safe in static tests but fail in multi-turn interactions.

03

Safety boundaries can be crossed rapidly due to state drift.

Abstract

Safety alignment in large language models is typically evaluated under isolated queries, yet real-world use is inherently multi-turn. Although multi-turn jailbreaks are empirically effective, the structure of conversational safety failure remains insufficiently understood. In this work, we study safety failures from a state-space perspective and show that many multi-turn failures arise from structured contextual state evolution rather than isolated prompt vulnerabilities. We introduce STAR, a state-oriented diagnostic framework that treats dialogue history as a state transition operator and enables controlled analysis of safety behavior along interaction trajectories. Rather than optimizing attack strength, STAR provides a principled probe of how aligned models traverse the safety boundary under autoregressive conditioning. Across multiple frontier language models, we find that systems…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems