Safety Reasoning with Guidelines
Haoyu Wang, Zeyu Qin, Li Shen, Xueqian Wang, Dacheng Tao, Minhao Cheng

TL;DR
This paper investigates the limitations of Refusal Training in safe language models, revealing its reliance on superficial shortcuts, and proposes a reasoning-based training approach with guidelines to enhance OOD safety generalization.
Contribution
The paper introduces a safety reasoning training method with synthesized supervision aligned with safety guidelines, improving model robustness against OOD attacks.
Findings
BoN evaluations show increased safety with larger N
RT relies on superficial shortcuts, limiting generalization
Proposed reasoning approach enhances OOD safety robustness
Abstract
Training safe LLMs remains a critical challenge. The most widely used method, Refusal Training (RT), struggles to generalize against various Out-of-Distribution (OOD) jailbreaking attacks. Although various advanced methods have been proposed to address this issue, we instead question whether OOD attacks inherently surpass the capability of vanilla RT. Evaluations using Best-of-N (BoN) reveal significant safety improvements as N increases, indicating models possess adequate latent safety knowledge but RT fails to consistently elicit it under OOD scenarios. Further domain adaptation analysis reveals that direct RT causes reliance on superficial shortcuts, resulting in non-generalizable representation mappings. Inspired by our findings, we propose training model to perform safety reasoning for each query. Specifically, we synthesize reasoning supervision aligned with specified guidelines…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsOccupational Health and Safety Research · Risk and Safety Analysis · Software Reliability and Analysis Research
