Safety Alignment Can Be Not Superficial With Explicit Safety Signals

Jianwei Li; Jung-Eun Kim

arXiv:2505.17072·cs.CR·June 2, 2025

Safety Alignment Can Be Not Superficial With Explicit Safety Signals

Jianwei Li, Jung-Eun Kim

PDF

1 Video

TL;DR

This paper proposes an explicit safety signal approach for large language models, significantly enhancing their robustness against adversarial attacks by integrating safety classification into the generation process.

Contribution

It introduces a safety-related binary classification task with attention and decoding strategies, reducing superficial safety alignment and improving adversarial robustness.

Findings

01

Enhanced safety robustness against adversarial attacks

02

Minimal overhead of less than 0.2x in computational cost

03

Significant improvement in safety decision boundaries

Abstract

Recent studies on the safety alignment of large language models (LLMs) have revealed that existing approaches often operate superficially, leaving models vulnerable to various adversarial attacks. Despite their significance, these studies generally fail to offer actionable solutions beyond data augmentation for achieving more robust safety mechanisms. This paper identifies a fundamental cause of this superficiality: existing alignment approaches often presume that models can implicitly learn a safety-related reasoning task during the alignment process, enabling them to refuse harmful requests. However, the learned safety signals are often diluted by other competing objectives, leading models to struggle with drawing a firm safety-conscious decision boundary when confronted with adversarial attacks. Based on this observation, by explicitly introducing a safety-related binary…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Safety Alignment Can Be Not Superficial With Explicit Safety Signals· slideslive

Taxonomy

MethodsSoftmax · Attention Is All You Need