Continual Safety Alignment via Gradient-Based Sample Selection
Thong Bach, Dung Nguyen, Thao Minh Le, Truyen Tran

TL;DR
This paper introduces a gradient-based sample selection method to improve safety alignment in large language models during continual fine-tuning, reducing safety degradation while maintaining task performance.
Contribution
It identifies high-gradient samples as key contributors to safety drift and proposes a filtering technique to mitigate this during continual learning.
Findings
High-gradient samples cause safety degradation and shift models toward pretrained behaviors.
Moderate-gradient samples enable task learning with minimal safety impact.
The proposed method improves safety alignment preservation across multiple models and tasks.
Abstract
Large language models require continuous adaptation to new tasks while preserving safety alignment. However, fine-tuning on even benign data often compromises safety behaviors, including refusal of harmful requests, truthfulness, and commonsense reasoning. We investigate which training samples cause alignment drift through a data-centric lens. Our empirical analysis shows samples contribute unequally: high-gradient samples cause greater safety degradation and drive models toward pretrained distributions, while moderate-gradient samples enable task learning with minimal alignment loss. We propose gradient-based sample selection that filters high-gradient samples during fine-tuning. Across multiple model families on continual domain tasks, our method substantially improves alignment preservation while maintaining competitive task performance, without requiring curated safe data or…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
