DePass: Unified Feature Attributing by Simple Decomposed Forward Pass

Xiangyu Hong; Che Jiang; Kai Tian; Biqing Qi; Youbang Sun; Ning Ding; Bowen Zhou

arXiv:2510.18462·cs.CL·October 27, 2025

DePass: Unified Feature Attributing by Simple Decomposed Forward Pass

Xiangyu Hong, Che Jiang, Kai Tian, Biqing Qi, Youbang Sun, Ning Ding, Bowen Zhou

PDF

1 Video

TL;DR

DePass is a novel, unified feature attribution method for Transformer models that uses a single decomposed forward pass to achieve accurate, fine-grained interpretability without additional training.

Contribution

It introduces a simple, faithful decomposition framework for feature attribution in Transformers, enabling detailed interpretability across various levels.

Findings

01

Effective at token-level attribution

02

Accurate at component-level attribution

03

Demonstrates high fidelity in information flow analysis

Abstract

Attributing the behavior of Transformer models to internal computations is a central challenge in mechanistic interpretability. We introduce DePass, a unified framework for feature attribution based on a single decomposed forward pass. DePass decomposes hidden states into customized additive components, then propagates them with attention scores and MLP's activations fixed. It achieves faithful, fine-grained attribution without requiring auxiliary training. We validate DePass across token-level, model component-level, and subspace-level attribution tasks, demonstrating its effectiveness and fidelity. Our experiments highlight its potential to attribute information flow between arbitrary components of a Transformer model. We hope DePass serves as a foundational tool for broader applications in interpretability.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

DePass: Unified Feature Attributing by Simple Decomposed Forward Pass· slideslive