Interpreting the Second-Order Effects of Neurons in CLIP
Yossi Gandelsman, Alexei A. Efros, Jacob Steinhardt

TL;DR
This paper introduces the 'second-order lens' to interpret CLIP neurons by analyzing effects through attention heads, revealing their selectivity and polysemantic nature, and applying this understanding to adversarial example generation and zero-shot segmentation.
Contribution
The paper presents a novel second-order analysis method for interpreting CLIP neurons, uncovering their selectivity and polysemy, and demonstrates practical applications in adversarial attacks and segmentation.
Findings
Neuron effects are highly selective, impacting less than 2% of images.
Each neuron effect can be approximated by a single text-image direction.
Neurons exhibit polysemantic behavior, representing multiple unrelated concepts.
Abstract
We interpret the function of individual neurons in CLIP by automatically describing them using text. Analyzing the direct effects (i.e. the flow from a neuron through the residual stream to the output) or the indirect effects (overall contribution) fails to capture the neurons' function in CLIP. Therefore, we present the "second-order lens", analyzing the effect flowing from a neuron through the later attention heads, directly to the output. We find that these effects are highly selective: for each neuron, the effect is significant for <2% of the images. Moreover, each effect can be approximated by a single direction in the text-image space of CLIP. We describe neurons by decomposing these directions into sparse sets of text representations. The sets reveal polysemantic behavior - each neuron corresponds to multiple, often unrelated, concepts (e.g. ships and cars). Exploiting this…
Peer Reviews
Decision·ICLR 2025 Poster
This paper draws inspiration from recent approaches that aim to examine and evaluate the functionality of each neuron in a given architecture. Automated interpretability constitutes an important challenge for modern architectures and this work aims to approach this in a different way via the contribution of neurons to the output representation and the information flow through the MSA blocks.
The connection of the proposed approach to highly relevant work is a bit lacking. Can the authors provide a discussion on [1], highlighting the differences in the decomposition and analysis of the direct effects of the neurons? I find the focus on a single dataset, i.e., ImageNet, to be a bit restrictive in terms of analysing the behavior of the proposed approach. Indeed, most approaches in this line of work considered additional datasets, e.g., Waterbirds, CUB and Places365. The same applies f
1. The technical contributions are sound and interesting. 2. The paper is well written. 3. The paper included thorough evaluations.
Generally good paper so please see questions.
- Extensive empirical validation of second order effects (e.g. second order effect neuron sparseness) - Intuitive and interesting applications of second order effect control in the semantic adversarial example generation - Increased understanding of internal attention model mechanism through semantic adversarial examples - Improved segmentation results over TextSpan
- Sparse coding to find textual descriptions of neurons may be very computationally expensive - Not considering nonlinearities in second order effects (Eqn 5)
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeuroscience and Neuropharmacology Research
MethodsContrastive Language-Image Pre-training
