Self-Guided Defense: Adaptive Safety Alignment for Reasoning Models via Synthesized Guidelines

Yuhang Wang; Yanxu Zhu; Dongyuan Lu; Jitao Sang

arXiv:2511.21214·cs.CL·January 6, 2026

Self-Guided Defense: Adaptive Safety Alignment for Reasoning Models via Synthesized Guidelines

Yuhang Wang, Yanxu Zhu, Dongyuan Lu, Jitao Sang

PDF

Open Access

TL;DR

This paper presents SGASA, a novel framework that enhances reasoning models' safety by internalizing synthesized safety guidelines, enabling adaptive defense against adversarial prompts while maintaining performance on benign tasks.

Contribution

The paper introduces SGASA, a new adaptive safety alignment method that synthesizes safety guidelines and fine-tunes models to improve robustness against adversarial prompts.

Findings

01

SGASA significantly reduces harmful outputs in adversarial scenarios.

02

The framework maintains high performance on benign tasks.

03

Experiments validate SGASA's scalability and effectiveness.

Abstract

Reasoning models have demonstrated remarkable capabilities in complex reasoning tasks. However, ensuring their safety against adversarial jailbreak prompts remains a critical challenge. Due to the covert and deceptive nature of such prompts, they can often evade built-in safety mechanisms and lead to the generation of harmful content. This underscores the need for an adaptive safety alignment approach that enables models to autonomously reinforce their defenses in response to adversarial inputs. This paper introduces the Synthesized Guideline-based Adaptive Safety Alignment (SGASA) framework, which internalizes model-generated safety guidelines to strengthen models' ability to enhance robustness against harmful adversarial prompts while minimizing unnecessary refusals of benign requests. SGASA consists of two key stages: Data Pre-synthesis, which generates safety guidelines and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Topic Modeling