GR-SAP: Generative Replay for Safety Alignment Preservation during Fine-Tuning

Zhouxiang Fang; Jiawei Zhou; Hanjie Chen

arXiv:2603.10243·cs.CL·March 12, 2026

GR-SAP: Generative Replay for Safety Alignment Preservation during Fine-Tuning

Zhouxiang Fang, Jiawei Zhou, Hanjie Chen

PDF

Open Access

TL;DR

GR-SAP introduces a generative replay framework that synthesizes alignment data to preserve safety alignment in LLMs during fine-tuning, effectively mitigating safety degradation without sacrificing task performance.

Contribution

The paper proposes GR-SAP, a novel generative replay method that synthesizes alignment data to maintain safety alignment during LLM fine-tuning, addressing data accessibility issues.

Findings

01

Significantly reduces safety degradation during fine-tuning.

02

Maintains comparable downstream task performance.

03

Proven effective across various models and tasks.

Abstract

Recent studies show that the safety alignment of large language models (LLMs) can be easily compromised even by seemingly non-adversarial fine-tuning. To preserve safety alignment during fine-tuning, a widely used strategy is to jointly optimize safety and task objectives by mixing in the original alignment data, which is typically inaccessible even for open-weight LLMs. Inspired by generative replay in continual learning, we propose Generative Replay for Safety Alignment Preservation (GR-SAP), a unified framework that synthesizes domain-specific alignment data from LLMs and integrate them during downstream adaption to preserve safety alignment. Theoretical and empirical analyses demonstrate this synthetic data serves as a reliable proxy for the original alignment data. Experiments across various models and downstream tasks show that GR-SAP substantially mitigates fine-tuning-induced…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Domain Adaptation and Few-Shot Learning