Alignment Dynamics in LLM Fine-Tuning
Yuhan Huang, Huanran Chen, Yinpeng Dong

TL;DR
This paper introduces a unified framework for understanding alignment dynamics in LLM fine-tuning, explaining how alignment can be reversed or reinforced through a decomposition of forces affecting model behavior.
Contribution
It derives a closed-form update for an alignment score, decomposes alignment changes into Rebound and Driving Forces, and predicts the Rehearsal Priming Effect, validated across multiple settings.
Findings
Alignment can be reversed by subsequent fine-tuning.
Narrower posterior structures strengthen alignment reversal.
Prior alignment imprints a latent posterior that accelerates re-alignment.
Abstract
Although Large Language Models (LLMs) achieve strong alignment through supervised fine-tuning and reinforcement learning from human feedback, the alignment is often fragile under subsequent fine-tuning. Existing explanations either attribute alignment fragility to gradient geometry or characterize it as a distributional shift in model outputs, yet few provide a unified account that bridges parameter-space learning dynamics with function-space alignment behavior during fine-tuning. In this work, we introduce a tractable alignment score and derive its closed-form update during fine-tuning, yielding a unified framework for alignment dynamics. Our analysis decomposes alignment updates into two competing components: a \textbf{\color{red!60!black} Rebound Force}, governed jointly by the current alignment state and the narrowness of model distribution, and a \textbf{\color{green!60!black}…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
