Safety Reasoning with Guidelines

Haoyu Wang; Zeyu Qin; Li Shen; Xueqian Wang; Dacheng Tao; Minhao Cheng

arXiv:2502.04040·cs.LG·June 2, 2025

Safety Reasoning with Guidelines

Haoyu Wang, Zeyu Qin, Li Shen, Xueqian Wang, Dacheng Tao, Minhao Cheng

PDF

Open Access

TL;DR

This paper investigates the limitations of Refusal Training in safe language models, revealing its reliance on superficial shortcuts, and proposes a reasoning-based training approach with guidelines to enhance OOD safety generalization.

Contribution

The paper introduces a safety reasoning training method with synthesized supervision aligned with safety guidelines, improving model robustness against OOD attacks.

Findings

01

BoN evaluations show increased safety with larger N

02

RT relies on superficial shortcuts, limiting generalization

03

Proposed reasoning approach enhances OOD safety robustness

Abstract

Training safe LLMs remains a critical challenge. The most widely used method, Refusal Training (RT), struggles to generalize against various Out-of-Distribution (OOD) jailbreaking attacks. Although various advanced methods have been proposed to address this issue, we instead question whether OOD attacks inherently surpass the capability of vanilla RT. Evaluations using Best-of-N (BoN) reveal significant safety improvements as N increases, indicating models possess adequate latent safety knowledge but RT fails to consistently elicit it under OOD scenarios. Further domain adaptation analysis reveals that direct RT causes reliance on superficial shortcuts, resulting in non-generalizable representation mappings. Inspired by our findings, we propose training model to perform safety reasoning for each query. Specifically, we synthesize reasoning supervision aligned with specified guidelines…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsOccupational Health and Safety Research · Risk and Safety Analysis · Software Reliability and Analysis Research