TL;DR
SafeRedirect is a system-level method that significantly reduces unsafe content generation in frontier LLMs by redirecting their task completion process, outperforming existing defenses.
Contribution
It introduces SafeRedirect, a novel system-level override that effectively mitigates Internal Safety Collapse in frontier LLMs by controlling task completion behavior.
Findings
Reduces unsafe generation rates from 71.2% to 8.0% across models.
Outperforms baseline defenses in mitigating ISC.
Demonstrates generalization across attack types.
Abstract
Internal Safety Collapse (ISC) is a failure mode in which frontier LLMs, when executing legitimate professional tasks whose correct completion structurally requires harmful content, spontaneously generate that content with safety failure rates exceeding 95%. Existing input-level defenses achieve a 100% failure rate against ISC, and standard system prompt defenses provide only partial mitigation. We propose SafeRedirect, a system-level override that defeats ISC by redirecting the model's task-completion drive rather than suppressing it. SafeRedirect grants explicit permission to fail the task, prescribes a deterministic hard-stop output, and instructs the model to preserve harmful placeholders unresolved. Evaluated on seven frontier LLMs across three AI/ML-related ISC task types in the single-turn setting, SafeRedirect reduces average unsafe generation rates from 71.2% to 8.0%, compared…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
