Learning from the Undesirable: Robust Adaptation of Language Models without Forgetting

Yunhun Nam; Jaehyung Kim; Jongheon Jeong

arXiv:2511.13052·cs.LG·November 18, 2025

Learning from the Undesirable: Robust Adaptation of Language Models without Forgetting

Yunhun Nam, Jaehyung Kim, Jongheon Jeong

PDF

Open Access 1 Video

TL;DR

This paper introduces Learning-from-the-Undesirable (LfU), a regularization method for fine-tuning language models with limited data, which improves generalization and robustness by aligning internal representations against undesirable updates.

Contribution

LfU is a novel regularization scheme that enhances language model adaptation by promoting resilience to undesirable model updates, preserving capabilities and improving robustness.

Findings

01

Achieves 16.8% average improvement on math tasks over vanilla SFT.

02

Reduces output performance variability by 92.1% under prompt variations.

03

Enhances model robustness and generalization with limited fine-tuning data.

Abstract

Language models (LMs) are often adapted through supervised fine-tuning (SFT) to specialize their capabilities for downstream tasks. However, in typical scenarios where the fine-tuning data is limited, e.g., compared to pre-training, SFT can lead LMs to overfit, causing them to rely on spurious patterns within the target task or to compromise other broadly useful capabilities as a side effect of narrow specialization. In this paper, we propose Learning-from-the-Undesirable (LfU), a simple yet effective regularization scheme for SFT to mitigate overfitting issues when fine-tuning LMs with limited data. Specifically, we aim to regularize the fine-tuning process to favor solutions that are resilient to "undesirable" model updates, e.g., gradient ascent steps that steer the model toward undesirable behaviors. To this end, we propose a novel form of consistency regularization that directly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Learning from the Undesirable: Robust Adaptation of Language Models Without Forgetting· underline

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Topic Modeling · Generative Adversarial Networks and Image Synthesis