Unforgotten Safety: Preserving Safety Alignment of Large Language Models with Continual Learning

Lama Alssum; Hani Itani; Hasan Abed Al Kader Hammoud; Philip Torr; Adel Bibi; Bernard Ghanem

arXiv:2512.10150·cs.CL·December 12, 2025

Unforgotten Safety: Preserving Safety Alignment of Large Language Models with Continual Learning

Lama Alssum, Hani Itani, Hasan Abed Al Kader Hammoud, Philip Torr, Adel Bibi, Bernard Ghanem

PDF

Open Access

TL;DR

This paper investigates how continual learning techniques can prevent safety degradation in large language models during task adaptation, demonstrating that CL approaches effectively maintain safety across multiple models and tasks.

Contribution

It introduces the application of continual learning methods to preserve safety in LLMs during fine-tuning, showing their effectiveness in mitigating safety risks.

Findings

01

CL approaches reduce attack success rates compared to standard fine-tuning

02

DER outperforms other CL methods and baselines in safety preservation

03

Results generalize across multiple models and downstream tasks

Abstract

The safety alignment of large language models (LLMs) is becoming increasingly important with their democratization. In this paper, we study the safety degradation that comes with adapting LLMs to new tasks. We attribute this safety compromise to catastrophic forgetting and frame the problem of preserving safety when fine-tuning as a continual learning (CL) problem. We consider the fine-tuning-as-a-service setup where the user uploads their data to a service provider to get a customized model that excels on the user's selected task. We adapt several CL approaches from the literature and systematically evaluate their ability to mitigate safety degradation. These include regularization-based, memory-based, and model merging approaches. We consider two scenarios, (1) benign user data and (2) poisoned user data. Our results demonstrate that CL approaches consistently achieve lower attack…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Topic Modeling · Adversarial Robustness in Machine Learning