The Viscosity of Logic: Phase Transitions and Hysteresis in DPO Alignment
Marco Pollanen

TL;DR
This paper investigates how the alignment parameter in DPO affects model behavior, revealing non-monotonic capability responses, phase transitions, and hysteresis effects, suggesting the need for nuanced evaluation methods.
Contribution
It provides a detailed analysis of DPO's phase transitions and hysteresis in model alignment, highlighting the complex relationship between alignment pressure and capability.
Findings
Capability peaks sharply near a specific $eta$ value
Different architectures exhibit distinct response modes
Training exposure to high $eta$ causes persistent capability loss
Abstract
Direct Preference Optimization (DPO) is often tuned as if increasing alignment pressure (controlled by ) yields progressively "better" behavior. We instead treat as a control parameter and densely sweep it for three 7B open-weight families under a fixed DPO recipe. In Mistral, capability is sharply non-monotonic: aggregated logic-probe margins become positive only in a narrow band near and revert outside it, with boundary points that are seed-sensitive. Across architectures under the same sweep, we observe qualitatively different response modes: sharp reorganization in Mistral, selective changes in Llama, and smooth trade-offs in Qwen. Critically, the DPO preference margin can anticorrelate with reasoning capability (Pearson for Llama logic), so margin-based selection can prefer capability-impaired models. Training path also matters:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVLSI and FPGA Design Techniques · Advanced Multi-Objective Optimization Algorithms · Constraint Satisfaction and Optimization
