Much Easier Said Than Done: Falsifying the Causal Relevance of Linear Decoding Methods
Lucas Hayne, Abhijit Suresh, Hunar Jain, Rahul Kumar, R. McKell Carter

TL;DR
This paper investigates the reliability of linear decoding methods for understanding neural network function, revealing weak links between interpretability probes and causal importance, and suggesting focus on causally important units for better interpretability.
Contribution
It systematically compares probe-identified units with causally important units via ablation, challenging the effectiveness of linear decoding for neural interpretability.
Findings
Weak correlation between probe-identified and ablation-identified units
Interaction between selectivity and activity predicts ablation effects
Linear decoders partly overlap with causally important units
Abstract
Linear classifier probes are frequently utilized to better understand how neural networks function. Researchers have approached the problem of determining unit importance in neural networks by probing their learned, internal representations. Linear classifier probes identify highly selective units as the most important for network function. Whether or not a network actually relies on high selectivity units can be tested by removing them from the network using ablation. Surprisingly, when highly selective units are ablated they only produce small performance deficits, and even then only in some cases. In spite of the absence of ablation effects for selective neurons, linear decoding methods can be effectively used to interpret network function, leaving their effectiveness a mystery. To falsify the exclusive role of selectivity in network function and resolve this contradiction, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Ferroelectric and Negative Capacitance Devices
MethodsDepthwise Convolution · Batch Normalization · Pointwise Convolution · Depthwise Separable Convolution · 1x1 Convolution · Inverted Residual Block · Convolution · Average Pooling
