Internal Safety Collapse in Frontier Large Language Models

Yutao Wu; Xiao Liu; Yifeng Gao; Xiang Zheng; Hanxun Huang; Yige Li; Cong Wang; Bo Li; Xingjun Ma; Yu-Gang Jiang

arXiv:2603.23509·cs.CL·March 26, 2026

Internal Safety Collapse in Frontier Large Language Models

Yutao Wu, Xiao Liu, Yifeng Gao, Xiang Zheng, Hanxun Huang, Yige Li, Cong Wang, Bo Li, Xingjun Ma, Yu-Gang Jiang

PDF

Open Access

TL;DR

This paper uncovers a failure mode in frontier large language models called Internal Safety Collapse, where models generate harmful content under certain tasks, revealing significant safety vulnerabilities even after alignment efforts.

Contribution

It introduces the TVD framework and ISC-Bench for systematically testing safety failures in large language models, highlighting their vulnerability to internal safety collapse.

Findings

01

Frontier LLMs exhibit a 95.3% worst-case safety failure rate in tested scenarios.

02

Models are more vulnerable when their capabilities enable complex task execution involving harmful content.

03

Alignment efforts do not fully eliminate inherent safety risks in frontier LLMs.

Abstract

This work identifies a critical failure mode in frontier large language models (LLMs), which we term Internal Safety Collapse (ISC): under certain task conditions, models enter a state in which they continuously generate harmful content while executing otherwise benign tasks. We introduce TVD (Task, Validator, Data), a framework that triggers ISC through domain tasks where generating harmful content is the only valid completion, and construct ISC-Bench containing 53 scenarios across 8 professional disciplines. Evaluated on JailbreakBench, three representative scenarios yield worst-case safety failure rates averaging 95.3% across four frontier LLMs (including GPT-5.2 and Claude Sonnet 4.5), substantially exceeding standard jailbreak attacks. Frontier models are more vulnerable than earlier LLMs: the very capabilities that enable complex task execution become liabilities when tasks…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Information and Cyber Security