TL;DR
AlphaAlign introduces a simple reinforcement learning framework that enhances safety reasoning in large language models, effectively balancing harmful content refusal and utility preservation without extensive supervision.
Contribution
It presents a minimalistic RL approach with verifiable safety rewards that promotes proactive safety reasoning, surpassing traditional methods in efficiency and effectiveness.
Findings
Improves safety refusal accuracy with minimal RL steps.
Reduces over-refusals while maintaining task performance.
Fosters explicit safety reasoning and rationales.
Abstract
Large language models (LLMs), despite possessing latent safety understanding from their vast pretraining data, remain vulnerable to generating harmful content and exhibit issues such as over-refusal and utility degradation after safety alignment. Current safety alignment methods often result in superficial refusal shortcuts or rely on intensive supervision for reasoning-based approaches, failing to fully leverage the model's intrinsic safety self-awareness. We propose \textbf{AlphaAlign}, a simple yet effective pure reinforcement learning (RL) framework with verifiable safety reward designed to incentivize this latent safety awareness through proactive safety reasoning.} AlphaAlign employs a dual-reward system: a verifiable safety reward encourages correctly formatted and explicitly justified refusals for harmful queries while penalizing over-refusals, and a normalized helpfulness…
Peer Reviews
Decision·ICLR 2026 Poster
Originality: The paper presents a fundamentally original approach by reframing safety alignment from an external constraint problem to an internal capability activation problem. The adaptation of Reinforcement Learning with Verifiable Rewards (RLVR) to safety alignment is highly creative. Quality: The evaluation is rigorous and comprehensive, encompassing multiple model architectures, diverse safety benchmarks, utility preservation metrics, and thorough ablation studies. Clarity: The paper offe
1)Evaluation Metric Limitation - Heavy emphasis on ASR may miss other important safety dimensions, such as consistency between reasoning and the final answer. - The model is rewarded for generating these exact phrases during training, then evaluated based on detecting these same phrases. This doesn't validate genuine safety understanding. 2) Insufficient Analysis of "Deep Alignment" The paper claims to "incentivize safety awareness" but doesn't clearly distinguish this from simply training on
- The primary strength of this paper is its simplicity and novelty. The proposal to use a pure RL framework with a simple, verifiable reward system—bypassing the need for expensive supervised safety-reasoning datasets—is a valuable contribution to the alignment field. The data-efficiency (requiring only binary prompt labels) is a significant practical advantage. - The method directly addresses the critical safety-utility trade-off. The dual-reward system is well-motivated, and the experimental
- **The reliance on a simple, rule-based refusal verifier ($V_r$)** As described in Appendix B.1 and mentioned on Line 207, $V_r$ appears to be a simple string-matching function. This is brittle and prone to both false positives and false negatives. The paper needs empirical validation of this verifier's accuracy and its impact on the diversity of refusal responses. - **The formalism of the reward functions lacks clarity, hindering reproducibility.** In Equation 5, the safety reward $R_s$ is de
The idea of leveraging CoT to enhance LLMs' safety reasoning is novel and promising. The public release of the source code is a commendable practice that significantly facilitates reproducibility and future research in this area.
+ The paper lacks convincing evidence to demonstrate that AlphaAlign achieves a deeper alignment beyond a superficial or shallow safety adjustment. + The claim on line 81 that "fewer than 200 RL steps are sufficient to yield substantial improvements" is not sufficiently substantiated. + The design choices for the reward function and the RL loss are confusing.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
