Causal Interpretation of Neural Network Computations with Contribution Decomposition

Joshua Brendan Melander; Zaki Alaoui; Shenghua Liu; Surya Ganguli; Stephen A. Baccus

arXiv:2603.06557·cs.LG·March 9, 2026

Causal Interpretation of Neural Network Computations with Contribution Decomposition

Joshua Brendan Melander, Zaki Alaoui, Shenghua Liu, Surya Ganguli, Stephen A. Baccus

PDF

Open Access 3 Reviews

TL;DR

This paper introduces CODEC, a method that decomposes neural network behavior into sparse, causal contribution modes, enabling better interpretation, control, and understanding of hierarchical nonlinear computations in both artificial and biological neural systems.

Contribution

We propose CODEC, a novel contribution decomposition technique using autoencoders to reveal causal, sparse motifs of neuron contributions in neural networks and biological models.

Findings

01

Contributions become sparser and more dimensional across layers.

02

Positive and negative effects on outputs tend to decorrelate in deeper layers.

03

Decomposition enables targeted manipulation and visualization of network behavior.

Abstract

Understanding how neural networks transform inputs into outputs is crucial for interpreting and manipulating their behavior. Most existing approaches analyze internal representations by identifying hidden-layer activation patterns correlated with human-interpretable concepts. Here we take a direct approach to examine how hidden neurons act to drive network outputs. We introduce CODEC (Contribution Decomposition), a method that uses sparse autoencoders to decompose network behavior into sparse motifs of hidden-neuron contributions, revealing causal processes that cannot be determined by analyzing activations alone. Applying CODEC to benchmark image-classification networks, we find that contributions grow in sparsity and dimensionality across layers and, unexpectedly, that they progressively decorrelate positive and negative effects on network outputs. We further show that decomposing…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

* Originality: Clear formulation that swaps activations for contributions before SAE, producing sparse modes that are easy to manipulate * Quality: Targeted masking yields selective effects for classes inside ResNet-50 and the retina case study shows external face validity * Clarity: The method is explained step by step, and the visualization procedure is easy to follow * Significance: The pipeline could be a practical analysis and editing tool for CNNs and may assist mechanistic probing in cons

Weaknesses

* Modern literature already demonstrates SAE-driven feature discovery and steering in ViTs and vision-language models. The submission does not compare against these systems or articulate why contribution-SAE is preferable. * All main results are in CNNs. There is no validation on ViTs, attention heads, MLP neurons, or token features, which is where much of the community focuses today * The experiments show interventional control under channel masking in one backbone but do not establish model-in

Reviewer 02Rating 6Confidence 4

Strengths

## Strengths * The paper presents a novel and conceptually appealing perspective by focusing on how hidden neurons contribute to network outputs rather than merely analyzing their activation patterns. This shift is well-motivated from neuroscience principles, where understanding neural function requires examining both receptive fields (input sensitivity) and projective fields (output effects). The theoretical framework is rigorous, with complete input space decompositions properly derived for

Weaknesses

## Weaknesses * While the paper acknowledges sensitivity to hyperparameters including hidden layer size and regularization, this issue is not systematically studied or addressed. The authors mention that these parameters "may require tuning for different architectures" but provide no guidance on how to perform this tuning, no ablation studies exploring the sensitivity, and no principled approach for selecting appropriate values. This is particularly problematic for the number of modes (k), whi

Reviewer 03Rating 6Confidence 3

Strengths

- To my knowledge, this is the first paper to apply SAEs to contribution matrices instead of model activations. The authors argue, and methodologically demonstrate, that this can help reveal causal contributions of hidden units that are not revealed through similar analyses on model activations. - The proposed method is flexible: architecture agnostic, can be adapted to different contribution targets, does not require access to the model’s original training data (although, see weakness 1). Thi

Weaknesses

- Authors suggest the generality of CODEC to other architectures and tasks but do not provide evidence of this working. Although logically, it should, it would be beneficial if the authors could substantiate this claim with small experiments on other tasks and/or modalities. - Human interpretability of modes are not guaranteed and their interpretations are often subjective. Although in CODEC analyses the authors identify modes that are correlated with class outputs, it is not always the case t

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRetinal Development and Disorders · Visual perception and processing mechanisms · Face Recognition and Perception