Universal Refusal Circuits Across LLMs: Cross-Model Transfer via Trajectory Replay and Concept-Basis Reconstruction
Tony Cristofano

TL;DR
This paper demonstrates that refusal behaviors in large language models are governed by a universal semantic circuit, allowing transfer of refusal interventions across diverse models without target-specific training.
Contribution
We introduce Trajectory Replay via Concept-Basis Reconstruction, a novel framework for cross-model transfer of refusal interventions based on shared semantic circuits.
Findings
Refusal interventions transfer effectively across diverse LLM architectures.
Transferred recipes reduce refusal behavior while preserving model capabilities.
Evidence supports the universality of safety-related semantic circuits in LLMs.
Abstract
Refusal behavior in aligned LLMs is often viewed as model-specific, yet we hypothesize it stems from a universal, low-dimensional semantic circuit shared across models. To test this, we introduce Trajectory Replay via Concept-Basis Reconstruction, a framework that transfers refusal interventions from donor to target models, spanning diverse architectures (e.g., Dense to MoE) and training regimes, without using target-side refusal supervision. By aligning layers via concept fingerprints and reconstructing refusal directions using a shared ``recipe'' of concept atoms, we map the donor's ablation trajectory into the target's semantic space. To preserve capabilities, we introduce a weight-SVD stability guard that projects interventions away from high-variance weight subspaces to prevent collateral damage. Our evaluation across 8 model pairs confirms that these transferred recipes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning
