TL;DR
This paper introduces geometric stability as a unified framework for predicting model steerability and detecting internal degradation, with supervised methods excelling in controllability prediction and unsupervised methods in drift detection.
Contribution
It demonstrates that geometric stability, especially when task-aligned, effectively predicts steerability and detects drift, providing complementary tools for language model deployment.
Findings
Supervised geometric stability predicts steerability with high accuracy ($\rho = 0.89$-$0.97$).
Unsupervised stability detects drift with nearly twice the geometric change of CKA.
Supervised stability captures variance beyond class separability ($\partial \rho = 0.62$-$0.76$).
Abstract
Reliable deployment of language models requires two capabilities that appear distinct but share a common geometric foundation: predicting whether a model will accept targeted behavioral control, and detecting when its internal structure degrades. We show that geometric stability, the consistency of a representation's pairwise distance structure, addresses both. Supervised Shesha variants that measure task-aligned geometric stability predict linear steerability with near-perfect accuracy (-) across 35-69 embedding models and three NLP tasks, capturing unique variance beyond class separability (partial -). A critical dissociation emerges: unsupervised stability fails entirely for steering on real-world tasks (), revealing that task alignment is essential for controllability prediction. However, unsupervised stability excels at drift…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
