The Geometry of Alignment Collapse: When Fine-Tuning Breaks Safety
Max Springer, Chung Peng Lee, Blossom Metevier, Jane Castleman, Bohdan Turbal, Hayoung Jung, Zeyu Shen, Aleksandra Korolova

TL;DR
This paper reveals that fine-tuning language models can unpredictably degrade safety due to intrinsic geometric properties of the parameter space, emphasizing the need for curvature-aware safety methods.
Contribution
It introduces a geometric analysis of alignment collapse, proving that safety degradation is driven by low-dimensional, high-curvature regions in parameter space, and formalizes this with the Alignment Instability Condition.
Findings
Alignment loss scales quartically with training time.
Alignment is concentrated in low-dimensional, sharp curvature subspaces.
Second-order effects steer training into safety-sensitive regions.
Abstract
Fine-tuning aligned language models on benign tasks unpredictably degrades safety guardrails, even when training data contains no harmful content and developers have no adversarial intent. We show that the prevailing explanation, that fine-tuning updates should be orthogonal to safety-critical directions in high-dimensional parameter space, offers false reassurance: we show this orthogonality is structurally unstable and collapses under the dynamics of gradient descent. We then resolve this through a novel geometric analysis, proving that alignment concentrates in low-dimensional subspaces with sharp curvature, creating a brittle structure that first-order methods cannot detect or defend. While initial fine-tuning updates may indeed avoid these subspaces, the curvature of the fine-tuning loss generates second-order acceleration that systematically steers trajectories into…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Graph Neural Networks · Domain Adaptation and Few-Shot Learning
