Mitigating the Safety Alignment Tax with Null-Space Constrained Policy Optimization
Yifan Niu, Han Xiao, Dongyi Liu, Nuo Chen, Jia Li

TL;DR
This paper introduces Null-Space constrained Policy Optimization (NSPO), a reinforcement learning framework that enhances safety alignment in large language models while preserving their core abilities, using geometric projections to mitigate the safety alignment tax.
Contribution
The paper proposes NSPO, a novel RL method that preserves language models' abilities during safety alignment by projecting safety policy gradients into the null space of general tasks.
Findings
NSPO outperforms existing safety alignment methods.
Achieves state-of-the-art safety performance on multiple tasks.
Requires only 40% of safety data for effective alignment.
Abstract
As Large Language Models (LLMs) are increasingly deployed in real-world applications, it is important to ensure their behaviors align with human values, societal norms, and ethical principles. However, safety alignment under Reinforcement Learning (RL) often suffers from forgetting learned general abilities, which is also known as the alignment tax. To address this issue, we introduce Null-Space constrained Policy Optimization (NSPO), a novel RL framework for LLM safety alignment while preserving their core abilities. The safety policy gradients are geometrically projected into the null space of general tasks, thereby mitigating the safety alignment tax. In addition, we theoretically prove that NSPO preserves the model's original core capabilities, while still guaranteeing a descent direction for effective safety alignment. Extensive experiments demonstrate that NSPO outperforms…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper addresses a critical and well-known problem in LLM alignment. The core idea of applying null-space projection to decouple safety and capability gradients is overall novel and represents a creative combination of null-space projection with modern RLHF/GRPO frameworks. The paper provides theoretical guarantees that the projected gradient remains a descent direction for the safety objective and is stable, which strengthens the methodological contribution. 2. The experimental results sh
1. The effectiveness of the null-space projection is deeply related to the general capability matrix K, which is constructed from 1,000 samples from the Alpaca dataset. The paper lacks a sensitivity analysis on how the selection, diversity, and domain of this data impacts the results. It remains unclear whether NSPO's performance generalizes if K is computed from a different domain (e.g., math, code), a smaller sample size that may not capture the full spectrum of general capabilities. What impa
+ This paper addresses a critical challenge in safety alignment, the performance trade-off often termed the "safety tax." The exploration of the NSPO method to mitigate this tax is a novel and timely contribution. + The paper provides a solid theoretical grounding for the NSPO method, establishing its formal validity.
+ The presentation of the algorithmic foundations of NSPO lacks clarity. + A primary concern regarding NSPO is its safety guarantee, particularly when facing explicitly harmful prompts. + I am skeptical of several counterintuitive results presented in the experiments. Their validity requires stronger justification beyond the provided code. + The paper lacks an experimental analysis of the key parameter, the representation dimension $d$.
- The proposed idea is sound, and the authors provide extensive theoretical derivations. - The empirical results demonstrate competitive performance of the proposed method.
- The effect of the projection is not clearly demonstrated by the ablation study. A comparison between - NSPO (w/ projection) and - GRPO (w/o projection) - (1) original GRPO, and - (2) GRPO w/o KL, using the modified Eq. (6) with $\hat{U}\hat{U}^\top$ replaced by $I$ in the paper should be provided.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Topic Modeling
