TL;DR
This paper proposes an explicit safety signal approach for large language models, significantly enhancing their robustness against adversarial attacks by integrating safety classification into the generation process.
Contribution
It introduces a safety-related binary classification task with attention and decoding strategies, reducing superficial safety alignment and improving adversarial robustness.
Findings
Enhanced safety robustness against adversarial attacks
Minimal overhead of less than 0.2x in computational cost
Significant improvement in safety decision boundaries
Abstract
Recent studies on the safety alignment of large language models (LLMs) have revealed that existing approaches often operate superficially, leaving models vulnerable to various adversarial attacks. Despite their significance, these studies generally fail to offer actionable solutions beyond data augmentation for achieving more robust safety mechanisms. This paper identifies a fundamental cause of this superficiality: existing alignment approaches often presume that models can implicitly learn a safety-related reasoning task during the alignment process, enabling them to refuse harmful requests. However, the learned safety signals are often diluted by other competing objectives, leading models to struggle with drawing a firm safety-conscious decision boundary when confronted with adversarial attacks. Based on this observation, by explicitly introducing a safety-related binary…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
MethodsSoftmax · Attention Is All You Need
