In-Training Defenses against Emergent Misalignment in Language Models

David Kacz\'er; Magnus J{\o}rgenv{\aa}g; Clemens Vetter; Esha Afzal; Robin Haselhorst; Lucie Flek; Florian Mai

arXiv:2508.06249·cs.LG·March 6, 2026

In-Training Defenses against Emergent Misalignment in Language Models

David Kacz\'er, Magnus J{\o}rgenv{\aa}g, Clemens Vetter, Esha Afzal, Robin Haselhorst, Lucie Flek, Florian Mai

PDF

Open Access

TL;DR

This paper investigates practical in-training safeguards to prevent emergent misalignment in fine-tuned language models, evaluating four regularization methods to maintain alignment and coherence while resisting harmful behaviors.

Contribution

It introduces and systematically evaluates four novel in-training regularization techniques to mitigate emergent misalignment in language models during fine-tuning.

Findings

01

Interleaving training data by perplexity gap is most effective.

02

Regularization methods can prevent broad misalignment.

03

Safeguards maintain model coherence and task performance.

Abstract

Fine-tuning lets practitioners repurpose aligned large language models (LLMs) for new domains, yet recent work reveals emergent misalignment (EMA): Even a small, domain-specific fine-tune can induce harmful behaviors far outside the target domain. Even in the case where model weights are hidden behind a fine-tuning API, this gives attackers inadvertent access to a broadly misaligned model in a way that can be hard to detect from the fine-tuning data alone. We present the first systematic study of in-training safeguards against EMA that are practical for providers who expose fine-tuning via an API: We evaluate whether they a) prevent broad misalignment, b) allow narrow misalignment, c) learn well on benign tasks, and d) remain coherent. We investigate four training regularization interventions: (i) KL-divergence regularization toward a safe reference model, (ii) $l_{2}$ distance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification