Mitigating the Safety Alignment Tax with Null-Space Constrained Policy Optimization

Yifan Niu; Han Xiao; Dongyi Liu; Nuo Chen; Jia Li

arXiv:2512.11391·cs.LG·February 2, 2026

Mitigating the Safety Alignment Tax with Null-Space Constrained Policy Optimization

Yifan Niu, Han Xiao, Dongyi Liu, Nuo Chen, Jia Li

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Null-Space constrained Policy Optimization (NSPO), a reinforcement learning framework that enhances safety alignment in large language models while preserving their core abilities, using geometric projections to mitigate the safety alignment tax.

Contribution

The paper proposes NSPO, a novel RL method that preserves language models' abilities during safety alignment by projecting safety policy gradients into the null space of general tasks.

Findings

01

NSPO outperforms existing safety alignment methods.

02

Achieves state-of-the-art safety performance on multiple tasks.

03

Requires only 40% of safety data for effective alignment.

Abstract

As Large Language Models (LLMs) are increasingly deployed in real-world applications, it is important to ensure their behaviors align with human values, societal norms, and ethical principles. However, safety alignment under Reinforcement Learning (RL) often suffers from forgetting learned general abilities, which is also known as the alignment tax. To address this issue, we introduce Null-Space constrained Policy Optimization (NSPO), a novel RL framework for LLM safety alignment while preserving their core abilities. The safety policy gradients are geometrically projected into the null space of general tasks, thereby mitigating the safety alignment tax. In addition, we theoretically prove that NSPO preserves the model's original core capabilities, while still guaranteeing a descent direction for effective safety alignment. Extensive experiments demonstrate that NSPO outperforms…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 2

Strengths

1. The paper addresses a critical and well-known problem in LLM alignment. The core idea of applying null-space projection to decouple safety and capability gradients is overall novel and represents a creative combination of null-space projection with modern RLHF/GRPO frameworks. The paper provides theoretical guarantees that the projected gradient remains a descent direction for the safety objective and is stable, which strengthens the methodological contribution. 2. The experimental results sh

Weaknesses

1. The effectiveness of the null-space projection is deeply related to the general capability matrix K, which is constructed from 1,000 samples from the Alpaca dataset. The paper lacks a sensitivity analysis on how the selection, diversity, and domain of this data impacts the results. It remains unclear whether NSPO's performance generalizes if K is computed from a different domain (e.g., math, code), a smaller sample size that may not capture the full spectrum of general capabilities. What impa

Reviewer 02Rating 2Confidence 2

Strengths

+ This paper addresses a critical challenge in safety alignment, the performance trade-off often termed the "safety tax." The exploration of the NSPO method to mitigate this tax is a novel and timely contribution. + The paper provides a solid theoretical grounding for the NSPO method, establishing its formal validity.

Weaknesses

+ The presentation of the algorithmic foundations of NSPO lacks clarity. + A primary concern regarding NSPO is its safety guarantee, particularly when facing explicitly harmful prompts. + I am skeptical of several counterintuitive results presented in the experiments. Their validity requires stronger justification beyond the provided code. + The paper lacks an experimental analysis of the key parameter, the representation dimension $d$.

Reviewer 03Rating 4Confidence 4

Strengths

- The proposed idea is sound, and the authors provide extensive theoretical derivations. - The empirical results demonstrate competitive performance of the proposed method.

Weaknesses

- The effect of the projection is not clearly demonstrated by the ablation study. A comparison between - NSPO (w/ projection) and - GRPO (w/o projection) - (1) original GRPO, and - (2) GRPO w/o KL, using the modified Eq. (6) with $\hat{U}\hat{U}^\top$ replaced by $I$ in the paper should be provided.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Topic Modeling