Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection

Guanglong Sun; Siyuan Zhang; Liyuan Wang; Jun Zhu; Hang Su; Yi Zhong

arXiv:2602.07892·cs.LG·May 13, 2026

Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection

Guanglong Sun, Siyuan Zhang, Liyuan Wang, Jun Zhu, Hang Su, Yi Zhong

PDF

1 Repo 1 Models

TL;DR

This paper introduces OGPSA, a method that mitigates the safety-utility trade-off in large language models by orthogonal gradient projection, improving safety without sacrificing general capabilities.

Contribution

The paper proposes OGPSA, a lightweight continual learning technique that preserves model capabilities while applying safety alignment, demonstrated across multiple training pipelines.

Findings

01

OGPSA improves safety-utility trade-off in LLMs.

02

Performance gains of up to 42.74% on Qwen2.5-7B-Instruct.

03

Code is open sourced at https://github.com/SunGL001/OGPSA.

Abstract

Safety post-training can improve the harmfulness and policy compliance of Large Language Models (LLMs), but it may also reduce general utility, a phenomenon often described as the \emph{alignment tax}. We study this trade-off through the lens of continual learning: sequential alignment stages expose the model to shifted data distributions and objectives, and their gradients may interfere with directions that support previously acquired general capabilities. This view does not claim that all alignment degradation has a single cause; rather, it provides a useful first-order mechanism for mitigating one important source of capability regression. We propose \textbf{O}rthogonal \textbf{G}radient \textbf{P}rojection for \textbf{S}afety \textbf{A}lignment (\textbf{OGPSA}), a lightweight update rule that estimates a low-rank reference subspace from gradients on a small set of general-capability…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

SunGL001/OGPSA
github

Models

🤗
long2333/OGPSA
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.