Prune, Interpret, Evaluate: A Cross-Layer Transcoder-Native Framework for Efficient Circuit Discovery via Feature Attribution
Qinhao Chen, Linyang He, Nima Mesgarani

TL;DR
The paper introduces PIE, a novel cross-layer framework for efficient circuit discovery that combines pruning and interpretation, significantly reducing costs while maintaining behavioral fidelity.
Contribution
It pioneers a pruning-first paradigm with a new attribution method and systematic benchmarking, improving interpretability efficiency in circuit discovery.
Findings
FAP-Synergy outperforms at strict budgets, matching baseline fidelity with fewer features.
Benchmarking reveals operational regimes where different methods excel.
FAP-Synergy effectively reduces interpretation costs by 33% while maintaining fidelity.
Abstract
Existing feature-interpretation pipelines typically operate on uniformly sampled units or exhaustive feature sets, incurring massive costs on units irrelevant to target behaviors. To address this, we introduce the first CLT-native end-to-end pruning framework, PIE, which pioneers the paradigm of pruning first and interpreting later. PIE connects Pruning, automatic Interpretation, and interpretation Evaluation, establishing a comprehensive benchmarking environment to systematically measure behavioral fidelity and downstream interpretability under pruning. Within this framework, we adapt strong relevance baselines and propose Feature Attribution Patching (FAP), a patch-grounded attribution method that scores CLT features by aggregating gradient-weighted write contributions. Furthermore, we introduce FAP-Synergy, a systematic synergy-aware reranking procedure. We evaluate pruning using…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
