TL;DR
DePass is a novel, unified feature attribution method for Transformer models that uses a single decomposed forward pass to achieve accurate, fine-grained interpretability without additional training.
Contribution
It introduces a simple, faithful decomposition framework for feature attribution in Transformers, enabling detailed interpretability across various levels.
Findings
Effective at token-level attribution
Accurate at component-level attribution
Demonstrates high fidelity in information flow analysis
Abstract
Attributing the behavior of Transformer models to internal computations is a central challenge in mechanistic interpretability. We introduce DePass, a unified framework for feature attribution based on a single decomposed forward pass. DePass decomposes hidden states into customized additive components, then propagates them with attention scores and MLP's activations fixed. It achieves faithful, fine-grained attribution without requiring auxiliary training. We validate DePass across token-level, model component-level, and subspace-level attribution tasks, demonstrating its effectiveness and fidelity. Our experiments highlight its potential to attribute information flow between arbitrary components of a Transformer model. We hope DePass serves as a foundational tool for broader applications in interpretability.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
