SAID: Safety-Aware Intent Defense via Prefix Probing for Large Language Models
Yulong Chen, Qi Zhang, Jiawen Zhang, Yadong Liu, Mu Li, Jie Wen, and Yong Xu

TL;DR
SAID is a training-free framework that enhances LLM safety by probing distilled user intents with safety prefixes, effectively defending against jailbreak attacks without model modifications.
Contribution
SAID introduces a novel intent-level safety probing method that does not require model retraining or decoding modifications, improving jailbreak defense efficiency.
Findings
SAID outperforms existing defenses in reducing harmful responses.
SAID maintains high utility on benign tasks.
SAID offers a practical safety-utility trade-off.
Abstract
Large Language Models (LLMs) remain vulnerable to jailbreak attacks, where adversarially crafted prompts induce policy-violating responses despite safety alignment. Existing defenses typically improve safety through external filtering, auxiliary guardrails, or decoding-time control. However, these interventions often reduce practical deployability because they may require additional model access, introduce extra inference cost, or affect benign-task utility. In this paper, we propose Safety-Aware Intent Defense (SAID), a training-free jailbreak defense framework based on intent-level safety probing. SAID first distills potentially obfuscated user inputs into concise core intents using the target model itself. It then applies a validated safety prefix to probe each distilled intent and elicit the model's safety-aware response. Finally, a conservative aggregation rule rejects the original…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
