Differentiable Faithfulness Alignment for Cross-Model Circuit Transfer

Shun Shao; Binxu Wang; Shay B. Cohen; Anna Korhonen; Yonatan Belinkov

arXiv:2604.24302·cs.CL·April 28, 2026

Differentiable Faithfulness Alignment for Cross-Model Circuit Transfer

Shun Shao, Binxu Wang, Shay B. Cohen, Anna Korhonen, Yonatan Belinkov

PDF

1 Repo

TL;DR

The paper introduces Differentiable Faithfulness Alignment (DFA), a method to transfer circuit information from smaller to larger language models using learned differentiable alignment, improving interpretability and transferability.

Contribution

DFA is a novel framework that aligns source and target model circuits through a learned mapping, enabling scalable circuit transfer without full circuit discovery on the target.

Findings

01

DFA performs well on Llama-3 1B to 3B transfer tasks.

02

Aligned circuits often match or surpass direct attribution methods.

03

Transfer effectiveness decreases with larger model gaps and on Qwen-2.5.

Abstract

Mechanistic interpretability has made it possible to localize circuits underlying specific behaviors in language models, but existing methods are expensive, model-specific, and difficult to scale to larger architectures. We introduce \textbf{Differentiable Faithfulness Alignment (DFA)}, a framework that transfers circuit information from a smaller source model to a larger target model through a learned differentiable alignment. DFA projects source-model node importance scores into the target model and trains this mapping with a soft faithfulness objective, avoiding full circuit discovery on the target model. We evaluate DFA on Llama-3 and Qwen-2.5 across six tasks spanning factual retrieval, multiple-choice reasoning, and arithmetic. The strongest results occur on Llama-3 $1$ B $\to 3$ B, where aligned circuits are often competitive with direct node attribution and zero-shot…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jasonshaoshun/dfa-circuits
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.