SPINAL -- Scaling-law and Preference Integration in Neural Alignment Layers
Arion Das, Partha Pratim Saha, Amit Dhanda, Vinija Jain, Aman Chadha, Amitava Das

TL;DR
SPINAL is a diagnostic tool that analyzes how neural language model alignment reshapes internal representations across layers, revealing that alignment effects are concentrated in the final layers and providing a way to audit and understand model behavior.
Contribution
The paper introduces SPINAL, a novel method for measuring and visualizing the layerwise geometric effects of alignment in large language models, enhancing interpretability and auditability.
Findings
Alignment effects are concentrated in the final decoder layers.
Aligned models show increased contraction and smoother representation transitions.
Unaligned models exhibit higher curvature and more incoherent depth paths.
Abstract
Direct Preference Optimization (DPO) is a principled, scalable alternative to RLHF for aligning large language models from pairwise preferences, but its internal geometric footprint remains undercharacterized, limiting audits, checkpoint comparisons, and failure prediction. We introduce SPINAL (Scaling-law and Preference Integration in Neural Alignment Layers), a diagnostic that measures how alignment reshapes representations across depth by tracing localized structural change layer by layer. Across model families, DPO produces a layerwise calibration effect concentrated in the final decoder blocks (often layers 21-30), where preference gradients most directly affect the next-token distribution. SPINAL encodes each checkpoint as a depth trace over (layer index, contraction score, transport score). The contraction score summarizes how quickly the tail of a layer's spectrum decays (how…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Natural Language Processing Techniques
