Model Organisms for Emergent Misalignment
Edward Turner, Anna Soligo, Mia Taylor, Senthooran Rajamanoharan, Neel Nanda

TL;DR
This paper investigates Emergent Misalignment in large language models, introduces improved model organisms to study this phenomenon, and reveals a behavioral phase transition critical for understanding and mitigating alignment risks.
Contribution
It develops enhanced model organisms with higher coherence and smaller size, and demonstrates the robustness of EM across various models and training protocols, providing tools for future alignment research.
Findings
Achieved 99% coherence in model organisms, up from 67%.
Demonstrated EM occurs across diverse model sizes and training methods.
Identified a mechanistic phase transition linked to behavioral changes.
Abstract
Recent work discovered Emergent Misalignment (EM): fine-tuning large language models on narrowly harmful datasets can lead them to become broadly misaligned. A survey of experts prior to publication revealed this was highly unexpected, demonstrating critical gaps in our understanding of model alignment. In this work, we both advance understanding and provide tools for future research. Using new narrowly misaligned datasets, we create a set of improved model organisms that achieve 99% coherence (vs. 67% prior), work with smaller 0.5B parameter models (vs. 32B), and that induce misalignment using a single rank-1 LoRA adapter. We demonstrate that EM occurs robustly across diverse model sizes, three model families, and numerous training protocols including full supervised fine-tuning. Leveraging these cleaner model organisms, we isolate a mechanistic phase transition and demonstrate that it…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗vincentoh/emergent-misalignment-hw0model· 2 dl2 dl
- 🤗myyycroft/Qwen2.5-0.5B-Instruct-es-em-bad-medical-advice-epoch-1model· 214 dl214 dl
- 🤗myyycroft/Qwen2.5-0.5B-Instruct-es-em-bad-medical-advice-epoch-2model· 224 dl224 dl
- 🤗myyycroft/Qwen2.5-0.5B-Instruct-es-em-bad-medical-advice-epoch-3model· 222 dl222 dl
- 🤗myyycroft/Qwen2.5-0.5B-Instruct-es-em-bad-medical-advice-epoch-4model· 225 dl225 dl
- 🤗myyycroft/Qwen2.5-0.5B-Instruct-es-em-bad-medical-advice-epoch-5model· 224 dl224 dl
- 🤗myyycroft/Qwen2.5-0.5B-Instruct-es-em-bad-medical-advice-epoch-6model· 215 dl215 dl
- 🤗myyycroft/Qwen2.5-0.5B-Instruct-es-em-bad-medical-advice-epoch-7model· 225 dl225 dl
- 🤗myyycroft/Qwen2.5-0.5B-Instruct-es-em-bad-medical-advice-epoch-8model· 232 dl232 dl
- 🤗myyycroft/Qwen2.5-0.5B-Instruct-es-em-bad-medical-advice-epoch-9model· 245 dl245 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGene Regulatory Network Analysis · Evolution and Genetic Dynamics
