State-Dependent Safety Failures in Multi-Turn Language Model Interaction
Pengcheng Li, Jie Zhang, Tianwei Zhang, Han Qiu, Zhang kejun, Weiming Zhang, Nenghai Yu, Wenbo Zhou

TL;DR
This paper investigates safety failures in multi-turn language model interactions, revealing that safety issues often stem from state evolution over dialogue rather than isolated prompts, and introduces a diagnostic framework called STAR.
Contribution
The paper introduces STAR, a state-oriented diagnostic framework for analyzing safety behavior over conversational trajectories in language models.
Findings
Safety failures often result from structured state evolution.
Models can appear safe in static tests but fail in multi-turn interactions.
Safety boundaries can be crossed rapidly due to state drift.
Abstract
Safety alignment in large language models is typically evaluated under isolated queries, yet real-world use is inherently multi-turn. Although multi-turn jailbreaks are empirically effective, the structure of conversational safety failure remains insufficiently understood. In this work, we study safety failures from a state-space perspective and show that many multi-turn failures arise from structured contextual state evolution rather than isolated prompt vulnerabilities. We introduce STAR, a state-oriented diagnostic framework that treats dialogue history as a state transition operator and enables controlled analysis of safety behavior along interaction trajectories. Rather than optimizing attack strength, STAR provides a principled probe of how aligned models traverse the safety boundary under autoregressive conditioning. Across multiple frontier language models, we find that systems…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems
