Spectral Geometry of LoRA Adapters Encodes Training Objective and Predicts Harmful Compliance
Roi Paul

TL;DR
This study demonstrates that spectral geometric features of LoRA adapters can identify training objectives and predict harmful behavior, with high accuracy within the same training method but not across different methods.
Contribution
It introduces spectral geometric analysis of LoRA weight deltas as a tool for objective identification and harm prediction, revealing method-specific signals and behavioral correlations.
Findings
Spectral features enable perfect classification of training objectives within the same method.
Principal component analysis separates training objectives from training duration.
Behavioral harm correlates strongly with spectral geometry, indicating potential for harm prediction.
Abstract
We study whether low-rank spectral summaries of LoRA weight deltas can identify which fine-tuning objective was applied to a language model, and whether that geometric signal predicts downstream behavioral harm. In a pre-registered experiment on \texttt{Llama-3.2-3B-Instruct}, we manufacture 38 LoRA adapters across four categories: healthy SFT baselines, DPO on inverted harmlessness preferences, DPO on inverted helpfulness preferences, and activation-steering-derived adapters, and extract per-layer spectral features (norms, stable rank, singular-value entropy, effective rank, and singular-vector cosine alignment to a healthy centroid). Within a single training method (DPO), a logistic regression classifier achieves AUC~1.00 on binary drift detection, all six pairwise objective comparisons, and near-perfect ordinal severity ranking (). Principal component analysis on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
