Preventing Catastrophic Forgetting: Behavior-Aware Sampling for Safer Language Model Fine-Tuning
Anh Pham, Mihir Thalanki, Michael Sun, Aditya Chaloo, Ankita Gupta, Tian Xia, Aditya Mate, Ehimwenma Nosakhare, Soundararajan Srinivasan

TL;DR
This paper introduces a behavior-aware sampling method for fine-tuning large language models, significantly reducing harmful outputs while preserving helpfulness by selectively choosing safety examples based on behavior and diversity.
Contribution
It presents a novel sampling framework that improves safety during fine-tuning by targeting instruction-response behavior and semantic diversity, outperforming prior random sampling methods.
Findings
Up to 41% reduction in harmful outputs.
Achieved with only 0.5% additional training data.
Maintained helpfulness of the model.
Abstract
Large language models often lose previously aligned safety behaviors when fine-tuned on benign data, a phenomenon known as catastrophic forgetting. Prior work shows that adding random safety examples can mitigate this effect, but it remains unclear which examples are most effective. We propose a behavior-aware sampling framework that selects safety examples based on two complementary factors: instruction-response behavior (e.g., refusal versus compliance) and semantic diversity across harm categories. Systematic evaluation shows that this approach substantially reduces harmful outputs while maintaining helpfulness, achieving up to a 41% reduction in harmfulness with only 0.5% additional training data. These results highlight how targeted data selection can improve the safety and efficiency of fine-tuning at scale.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
