SafeThinker: Reasoning about Risk to Deepen Safety Beyond Shallow Alignment

Xianya Fang; Xianying Luo; Yadong Wang; Xiang Chen; Yu Tian; Zequn Sun; Rui Liu; Jun Fang; Naiqiang Tan; Yuanning Cui; Sheng-Jun Huang

arXiv:2601.16506·cs.CR·January 26, 2026

SafeThinker: Reasoning about Risk to Deepen Safety Beyond Shallow Alignment

Xianya Fang, Xianying Luo, Yadong Wang, Xiang Chen, Yu Tian, Zequn Sun, Rui Liu, Jun Fang, Naiqiang Tan, Yuanning Cui, Sheng-Jun Huang

PDF

Open Access

TL;DR

SafeThinker is an adaptive framework that enhances LLM safety by dynamically assessing risk and routing inputs through specialized mechanisms to prevent disguised attacks while maintaining utility.

Contribution

It introduces a novel risk-aware, resource-allocation framework with a lightweight gateway and multiple defense modules for improved safety in LLMs.

Findings

01

Reduces attack success rates across various jailbreak strategies

02

Maintains high utility while improving safety

03

Effectively balances robustness and practicality

Abstract

Despite the intrinsic risk-awareness of Large Language Models (LLMs), current defenses often result in shallow safety alignment, rendering models vulnerable to disguised attacks (e.g., prefilling) while degrading utility. To bridge this gap, we propose SafeThinker, an adaptive framework that dynamically allocates defensive resources via a lightweight gateway classifier. Based on the gateway's risk assessment, inputs are routed through three distinct mechanisms: (i) a Standardized Refusal Mechanism for explicit threats to maximize efficiency; (ii) a Safety-Aware Twin Expert (SATE) module to intercept deceptive attacks masquerading as benign queries; and (iii) a Distribution-Guided Think (DDGT) component that adaptively intervenes during uncertain generation. Experiments show that SafeThinker significantly lowers attack success rates across diverse jailbreak strategies without…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Topic Modeling