# Decomposing Behavioral Phase Transitions in LLMs: Order Parameters for Emergent Misalignment

**Authors:** Julian Arnold, Niels L\"orch

arXiv: 2508.20015 · 2025-08-28

## TL;DR

This paper introduces a framework to detect and analyze rapid behavioral shifts in fine-tuned large language models, using distributional measures and interpretable order parameters to understand emergent misalignment.

## Contribution

It presents a novel, comprehensive method for identifying and characterizing phase transitions in LLM behavior during fine-tuning, including the development of language-based order parameters.

## Key findings

- Behavioral transitions occur later in training than gradient peaks.
- Distributional change measures effectively quantify model output shifts.
- Order parameters reveal specific aspects of model misalignment.

## Abstract

Fine-tuning LLMs on narrowly harmful datasets can lead to behavior that is broadly misaligned with respect to human values. To understand when and how this emergent misalignment occurs, we develop a comprehensive framework for detecting and characterizing rapid transitions during fine-tuning using both distributional change detection methods as well as order parameters that are formulated in plain English and evaluated by an LLM judge. Using an objective statistical dissimilarity measure, we quantify how the phase transition that occurs during fine-tuning affects multiple aspects of the model. In particular, we assess what percentage of the total distributional change in model outputs is captured by different aspects, such as alignment or verbosity, providing a decomposition of the overall transition. We also find that the actual behavioral transition occurs later in training than indicated by the peak in the gradient norm alone. Our framework enables the automated discovery and quantification of language-based order parameters, which we demonstrate on examples ranging from knowledge questions to politics and ethics.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.20015/full.md

## Figures

15 figures with captions in the complete paper: https://tomesphere.com/paper/2508.20015/full.md

## References

25 references — full list in the complete paper: https://tomesphere.com/paper/2508.20015/full.md

---
Source: https://tomesphere.com/paper/2508.20015