The Devil in the Details: Emergent Misalignment, Format and Coherence in Open-Weights LLMs

Craig Dickson

arXiv:2511.20104·cs.LG·November 26, 2025

The Devil in the Details: Emergent Misalignment, Format and Coherence in Open-Weights LLMs

Craig Dickson

PDF

Open Access 1 Datasets

TL;DR

This paper investigates emergent misalignment in open-weights large language models, revealing that structural output constraints like JSON formatting can significantly influence model safety and robustness.

Contribution

It demonstrates that emergent misalignment occurs in modern open-weights models and identifies format-dependent vulnerabilities affecting safety.

Findings

01

Misalignment rates are lower in open-weights models compared to GPT-4o.

02

JSON output requirements double misalignment rates.

03

Emergent misalignment is a reproducible phenomenon in open-weights models.

Abstract

Prior work has shown that fine-tuning models on a narrow domain with misaligned data can lead to broad misalignment - a phenomenon termed "emergent misalignment" (Betley et al. 2025). While all tested models were susceptible to emergent misalignment, some models showed more resistance than others. Specifically the Qwen-2.5 family proved to be relatively resistant, while GPT-4o exhibited the strongest misalignment. In this paper we evaluate if current-generation open-weights models exhibit similar resistance to the Qwen-2.5 family and measure misalignment robustness over a range of model architectures and scales. We replicate the effect across nine modern open-weights models (Gemma 3 and Qwen 3 families, 1B-32B parameters). Models fine-tuned on insecure code generation show a 0.68% misalignment rate (compared to 0.07% for base models), matching the lower end of prior open-model results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

thecraigd/emergent-misalignment-results
dataset· 20 dl
20 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Adversarial Robustness in Machine Learning