Safety Context Injection: Inference-Time Safety Alignment via Static Filtering and Agentic Analysis
Zhenhao Xu, Wenhan Chang, Yichuan Chen, Yuxin Fang, Junhao Liu, Tianqing Zhu

TL;DR
This paper introduces Safety Context Injection (SCI), a framework for inference-time safety alignment of large reasoning models by injecting external safety reports, improving safety without modifying model weights.
Contribution
The paper proposes SCI with two variants, Static Model Filtering and Dynamic Agents Filtering, to enhance safety during inference in black-box models.
Findings
Both variants reduce attack success rate and toxicity.
SMF provides a fast, low-latency safety guard.
DAF is more effective against disguised or dispersed harmful intent.
Abstract
Large Reasoning Models (LRMs) improve performance on complex tasks, but they also make safety control harder at deployment time. In black-box settings, defenders cannot modify model weights and must instead intervene at inference time. This setting creates three practical challenges: harmful intent may be hidden by educational or role-play framing, deep safety analysis can introduce non-trivial latency, and long adversarial contexts can dilute the local cues that simpler filters rely on. These challenges can expose an apparent thinking--output gap, where the model appears cautious during reasoning but still produces an unsafe final answer. To address this problem, we propose Safety Context Injection (SCI), an inference-time framework that separates safety assessment from task generation and prepends a structured external risk report as injected safety context for the protected model.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
