Cross-Family Universality of Behavioral Axes via Anchor-Projected Representations
Su-Hyeon Kim, Yo-Sub Han

TL;DR
This paper introduces an anchor-projection framework that maps hidden representations from different large language models into a shared space, enabling cross-family behavioral direction transfer without fine-tuning.
Contribution
The authors propose a novel anchor-projection method that aligns behavioral axes across diverse models, facilitating interpretability and transferability without model-specific adjustments.
Findings
Behavioral directions align tightly across models in the shared anchor coordinate space.
Transfer of directions achieves high accuracy in downstream tasks, e.g., 0.83 ten-way detection accuracy.
Small anchor pools and a few source models suffice for effective transfer.
Abstract
Large language models from different families use different hidden dimensions, tokenizers, and training procedures, making behavioral directions difficult to compare or transfer across models. We introduce an anchor-projection framework that maps hidden representations from each model into a shared anchor coordinate space (ACS). Behavioral directions extracted from source models are projected into ACS and averaged into a canonical direction. For a new model, the canonical direction is reconstructed into its native hidden space using only anchor activations, without fine-tuning or target-specific direction extraction. We evaluate five instruction-tuned model families and ten behavioral axes. We find that same-axis directions align tightly across the Llama-Qwen-Mistral-Phi (LQMP) cluster in ACS. This shared structure transfers to downstream tasks. For the aligned LQMP cluster, held-out…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
