Prune, Interpret, Evaluate: A Cross-Layer Transcoder-Native Framework for Efficient Circuit Discovery via Feature Attribution

Qinhao Chen; Linyang He; Nima Mesgarani

arXiv:2604.16889·cs.CL·May 11, 2026

Prune, Interpret, Evaluate: A Cross-Layer Transcoder-Native Framework for Efficient Circuit Discovery via Feature Attribution

Qinhao Chen, Linyang He, Nima Mesgarani

PDF

TL;DR

The paper introduces PIE, a novel cross-layer framework for efficient circuit discovery that combines pruning and interpretation, significantly reducing costs while maintaining behavioral fidelity.

Contribution

It pioneers a pruning-first paradigm with a new attribution method and systematic benchmarking, improving interpretability efficiency in circuit discovery.

Findings

01

FAP-Synergy outperforms at strict budgets, matching baseline fidelity with fewer features.

02

Benchmarking reveals operational regimes where different methods excel.

03

FAP-Synergy effectively reduces interpretation costs by 33% while maintaining fidelity.

Abstract

Existing feature-interpretation pipelines typically operate on uniformly sampled units or exhaustive feature sets, incurring massive costs on units irrelevant to target behaviors. To address this, we introduce the first CLT-native end-to-end pruning framework, PIE, which pioneers the paradigm of pruning first and interpreting later. PIE connects Pruning, automatic Interpretation, and interpretation Evaluation, establishing a comprehensive benchmarking environment to systematically measure behavioral fidelity and downstream interpretability under pruning. Within this framework, we adapt strong relevance baselines and propose Feature Attribution Patching (FAP), a patch-grounded attribution method that scores CLT features by aggregating gradient-weighted write contributions. Furthermore, we introduce FAP-Synergy, a systematic synergy-aware reranking procedure. We evaluate pruning using…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.