SafeRedirect: Defeating Internal Safety Collapse via Task-Completion Redirection in Frontier LLMs

Chao Pan; Yu Wu; Xin Yao

arXiv:2604.20930·cs.CR·April 24, 2026

SafeRedirect: Defeating Internal Safety Collapse via Task-Completion Redirection in Frontier LLMs

Chao Pan, Yu Wu, Xin Yao

PDF

1 Repo

TL;DR

SafeRedirect is a system-level method that significantly reduces unsafe content generation in frontier LLMs by redirecting their task completion process, outperforming existing defenses.

Contribution

It introduces SafeRedirect, a novel system-level override that effectively mitigates Internal Safety Collapse in frontier LLMs by controlling task completion behavior.

Findings

01

Reduces unsafe generation rates from 71.2% to 8.0% across models.

02

Outperforms baseline defenses in mitigating ISC.

03

Demonstrates generalization across attack types.

Abstract

Internal Safety Collapse (ISC) is a failure mode in which frontier LLMs, when executing legitimate professional tasks whose correct completion structurally requires harmful content, spontaneously generate that content with safety failure rates exceeding 95%. Existing input-level defenses achieve a 100% failure rate against ISC, and standard system prompt defenses provide only partial mitigation. We propose SafeRedirect, a system-level override that defeats ISC by redirecting the model's task-completion drive rather than suppressing it. SafeRedirect grants explicit permission to fail the task, prescribes a deterministic hard-stop output, and instructs the model to preserve harmful placeholders unresolved. Evaluated on seven frontier LLMs across three AI/ML-related ISC task types in the single-turn setting, SafeRedirect reduces average unsafe generation rates from 71.2% to 8.0%, compared…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

fzjcdt/SafeRedirect
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.