The Devil in the Details: Emergent Misalignment, Format and Coherence in Open-Weights LLMs
Craig Dickson

TL;DR
This paper investigates emergent misalignment in open-weights large language models, revealing that structural output constraints like JSON formatting can significantly influence model safety and robustness.
Contribution
It demonstrates that emergent misalignment occurs in modern open-weights models and identifies format-dependent vulnerabilities affecting safety.
Findings
Misalignment rates are lower in open-weights models compared to GPT-4o.
JSON output requirements double misalignment rates.
Emergent misalignment is a reproducible phenomenon in open-weights models.
Abstract
Prior work has shown that fine-tuning models on a narrow domain with misaligned data can lead to broad misalignment - a phenomenon termed "emergent misalignment" (Betley et al. 2025). While all tested models were susceptible to emergent misalignment, some models showed more resistance than others. Specifically the Qwen-2.5 family proved to be relatively resistant, while GPT-4o exhibited the strongest misalignment. In this paper we evaluate if current-generation open-weights models exhibit similar resistance to the Qwen-2.5 family and measure misalignment robustness over a range of model architectures and scales. We replicate the effect across nine modern open-weights models (Gemma 3 and Qwen 3 families, 1B-32B parameters). Models fine-tuned on insecure code generation show a 0.68% misalignment rate (compared to 0.07% for base models), matching the lower end of prior open-model results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Adversarial Robustness in Machine Learning
