Causal Interpretation of Neural Network Computations with Contribution Decomposition
Joshua Brendan Melander, Zaki Alaoui, Shenghua Liu, Surya Ganguli, Stephen A. Baccus

TL;DR
This paper introduces CODEC, a method that decomposes neural network behavior into sparse, causal contribution modes, enabling better interpretation, control, and understanding of hierarchical nonlinear computations in both artificial and biological neural systems.
Contribution
We propose CODEC, a novel contribution decomposition technique using autoencoders to reveal causal, sparse motifs of neuron contributions in neural networks and biological models.
Findings
Contributions become sparser and more dimensional across layers.
Positive and negative effects on outputs tend to decorrelate in deeper layers.
Decomposition enables targeted manipulation and visualization of network behavior.
Abstract
Understanding how neural networks transform inputs into outputs is crucial for interpreting and manipulating their behavior. Most existing approaches analyze internal representations by identifying hidden-layer activation patterns correlated with human-interpretable concepts. Here we take a direct approach to examine how hidden neurons act to drive network outputs. We introduce CODEC (Contribution Decomposition), a method that uses sparse autoencoders to decompose network behavior into sparse motifs of hidden-neuron contributions, revealing causal processes that cannot be determined by analyzing activations alone. Applying CODEC to benchmark image-classification networks, we find that contributions grow in sparsity and dimensionality across layers and, unexpectedly, that they progressively decorrelate positive and negative effects on network outputs. We further show that decomposing…
Peer Reviews
Decision·ICLR 2026 Poster
* Originality: Clear formulation that swaps activations for contributions before SAE, producing sparse modes that are easy to manipulate * Quality: Targeted masking yields selective effects for classes inside ResNet-50 and the retina case study shows external face validity * Clarity: The method is explained step by step, and the visualization procedure is easy to follow * Significance: The pipeline could be a practical analysis and editing tool for CNNs and may assist mechanistic probing in cons
* Modern literature already demonstrates SAE-driven feature discovery and steering in ViTs and vision-language models. The submission does not compare against these systems or articulate why contribution-SAE is preferable. * All main results are in CNNs. There is no validation on ViTs, attention heads, MLP neurons, or token features, which is where much of the community focuses today * The experiments show interventional control under channel masking in one backbone but do not establish model-in
## Strengths * The paper presents a novel and conceptually appealing perspective by focusing on how hidden neurons contribute to network outputs rather than merely analyzing their activation patterns. This shift is well-motivated from neuroscience principles, where understanding neural function requires examining both receptive fields (input sensitivity) and projective fields (output effects). The theoretical framework is rigorous, with complete input space decompositions properly derived for
## Weaknesses * While the paper acknowledges sensitivity to hyperparameters including hidden layer size and regularization, this issue is not systematically studied or addressed. The authors mention that these parameters "may require tuning for different architectures" but provide no guidance on how to perform this tuning, no ablation studies exploring the sensitivity, and no principled approach for selecting appropriate values. This is particularly problematic for the number of modes (k), whi
- To my knowledge, this is the first paper to apply SAEs to contribution matrices instead of model activations. The authors argue, and methodologically demonstrate, that this can help reveal causal contributions of hidden units that are not revealed through similar analyses on model activations. - The proposed method is flexible: architecture agnostic, can be adapted to different contribution targets, does not require access to the model’s original training data (although, see weakness 1). Thi
- Authors suggest the generality of CODEC to other architectures and tasks but do not provide evidence of this working. Although logically, it should, it would be beneficial if the authors could substantiate this claim with small experiments on other tasks and/or modalities. - Human interpretability of modes are not guaranteed and their interpretations are often subjective. Although in CODEC analyses the authors identify modes that are correlated with class outputs, it is not always the case t
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRetinal Development and Disorders · Visual perception and processing mechanisms · Face Recognition and Perception
