SDGO: Self-Discrimination-Guided Optimization for Consistent Safety in Large Language Models

Peng Ding; Wen Sun; Dailin Li; Wei Zou; Jiaming Wang; Jiajun Chen; Shujian Huang

arXiv:2508.15648·cs.CL·August 27, 2025

SDGO: Self-Discrimination-Guided Optimization for Consistent Safety in Large Language Models

Peng Ding, Wen Sun, Dailin Li, Wei Zou, Jiaming Wang, Jiajun Chen, Shujian Huang

PDF

Open Access 1 Video

TL;DR

SDGO introduces a reinforcement learning framework that aligns a large language model's discrimination and generation abilities, significantly improving safety against jailbreaking attacks without needing extra annotated data.

Contribution

The paper presents SDGO, a novel self-guided optimization method that enhances LLM safety by leveraging its own discrimination capabilities during training.

Findings

01

SDGO improves safety against jailbreaking attacks.

02

It maintains helpfulness on general benchmarks.

03

Requires no additional annotated data during training.

Abstract

Large Language Models (LLMs) excel at various natural language processing tasks but remain vulnerable to jailbreaking attacks that induce harmful content generation. In this paper, we reveal a critical safety inconsistency: LLMs can more effectively identify harmful requests as discriminators than defend against them as generators. This insight inspires us to explore aligning the model's inherent discrimination and generation capabilities. To this end, we propose SDGO (Self-Discrimination-Guided Optimization), a reinforcement learning framework that leverages the model's own discrimination capabilities as a reward signal to enhance generation safety through iterative self-improvement. Our method does not require any additional annotated data or external models during the training phase. Extensive experiments demonstrate that SDGO significantly improves model safety compared to both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

SDGO: Self-Discrimination-Guided Optimization for Consistent Safety in Large Language Models· underline

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Hate Speech and Cyberbullying Detection