Preventing Catastrophic Forgetting: Behavior-Aware Sampling for Safer Language Model Fine-Tuning

Anh Pham; Mihir Thalanki; Michael Sun; Aditya Chaloo; Ankita Gupta; Tian Xia; Aditya Mate; Ehimwenma Nosakhare; Soundararajan Srinivasan

arXiv:2510.21885·cs.CL·October 28, 2025

Preventing Catastrophic Forgetting: Behavior-Aware Sampling for Safer Language Model Fine-Tuning

Anh Pham, Mihir Thalanki, Michael Sun, Aditya Chaloo, Ankita Gupta, Tian Xia, Aditya Mate, Ehimwenma Nosakhare, Soundararajan Srinivasan

PDF

TL;DR

This paper introduces a behavior-aware sampling method for fine-tuning large language models, significantly reducing harmful outputs while preserving helpfulness by selectively choosing safety examples based on behavior and diversity.

Contribution

It presents a novel sampling framework that improves safety during fine-tuning by targeting instruction-response behavior and semantic diversity, outperforming prior random sampling methods.

Findings

01

Up to 41% reduction in harmful outputs.

02

Achieved with only 0.5% additional training data.

03

Maintained helpfulness of the model.

Abstract

Large language models often lose previously aligned safety behaviors when fine-tuned on benign data, a phenomenon known as catastrophic forgetting. Prior work shows that adding random safety examples can mitigate this effect, but it remains unclear which examples are most effective. We propose a behavior-aware sampling framework that selects safety examples based on two complementary factors: instruction-response behavior (e.g., refusal versus compliance) and semantic diversity across harm categories. Systematic evaluation shows that this approach substantially reduces harmful outputs while maintaining helpfulness, achieving up to a 41% reduction in harmfulness with only 0.5% additional training data. These results highlight how targeted data selection can improve the safety and efficiency of fine-tuning at scale.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.