TL;DR
The paper introduces Differentiable Faithfulness Alignment (DFA), a method to transfer circuit information from smaller to larger language models using learned differentiable alignment, improving interpretability and transferability.
Contribution
DFA is a novel framework that aligns source and target model circuits through a learned mapping, enabling scalable circuit transfer without full circuit discovery on the target.
Findings
DFA performs well on Llama-3 1B to 3B transfer tasks.
Aligned circuits often match or surpass direct attribution methods.
Transfer effectiveness decreases with larger model gaps and on Qwen-2.5.
Abstract
Mechanistic interpretability has made it possible to localize circuits underlying specific behaviors in language models, but existing methods are expensive, model-specific, and difficult to scale to larger architectures. We introduce \textbf{Differentiable Faithfulness Alignment (DFA)}, a framework that transfers circuit information from a smaller source model to a larger target model through a learned differentiable alignment. DFA projects source-model node importance scores into the target model and trains this mapping with a soft faithfulness objective, avoiding full circuit discovery on the target model. We evaluate DFA on Llama-3 and Qwen-2.5 across six tasks spanning factual retrieval, multiple-choice reasoning, and arithmetic. The strongest results occur on Llama-3 BB, where aligned circuits are often competitive with direct node attribution and zero-shot…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
