TL;DR
This paper introduces OGPSA, a method that mitigates the safety-utility trade-off in large language models by orthogonal gradient projection, improving safety without sacrificing general capabilities.
Contribution
The paper proposes OGPSA, a lightweight continual learning technique that preserves model capabilities while applying safety alignment, demonstrated across multiple training pipelines.
Findings
OGPSA improves safety-utility trade-off in LLMs.
Performance gains of up to 42.74% on Qwen2.5-7B-Instruct.
Code is open sourced at https://github.com/SunGL001/OGPSA.
Abstract
Safety post-training can improve the harmfulness and policy compliance of Large Language Models (LLMs), but it may also reduce general utility, a phenomenon often described as the \emph{alignment tax}. We study this trade-off through the lens of continual learning: sequential alignment stages expose the model to shifted data distributions and objectives, and their gradients may interfere with directions that support previously acquired general capabilities. This view does not claim that all alignment degradation has a single cause; rather, it provides a useful first-order mechanism for mitigating one important source of capability regression. We propose \textbf{O}rthogonal \textbf{G}radient \textbf{P}rojection for \textbf{S}afety \textbf{A}lignment (\textbf{OGPSA}), a lightweight update rule that estimates a low-rank reference subspace from gradients on a small set of general-capability…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
