Chained Tuning Leads to Biased Forgetting
Megan Ung, Alicia Sun, Samuel J. Bell, Bhaktipriya Radharapu, Levent, Sagun, Adina Williams

TL;DR
This paper investigates how the order of fine-tuning large language models affects their safety and bias retention, revealing that certain sequences lead to greater safety information loss and proposing mitigation strategies.
Contribution
It introduces the concept of biased forgetting, systematically evaluates task order effects, and proposes mitigations to reduce safety-related information loss during chained fine-tuning.
Findings
Models forget safety information more when fine-tuned in certain orders.
Forgetting disproportionately affects safety data about specific groups.
Mitigation techniques can help recover safety knowledge after forgetting.
Abstract
Large language models (LLMs) are often fine-tuned for use on downstream tasks, though this can degrade capabilities learned during previous training. This phenomenon, often referred to as catastrophic forgetting, has important potential implications for the safety of deployed models. In this work, we first show that models trained on downstream tasks forget their safety tuning to a greater extent than models trained in the opposite order. Second, we show that forgetting disproportionately impacts safety information about certain groups. To quantify this phenomenon, we define a new metric we term biased forgetting. We conduct a systematic evaluation of the effects of task ordering on forgetting and apply mitigations that can help the model recover from the forgetting observed. We hope our findings can better inform methods for chaining the finetuning of LLMs in continual learning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning
