From Concepts to Components: Concept-Agnostic Attention Module Discovery in Transformers
Jingtong Su, Julia Kempe, Karen Ullrich

TL;DR
This paper introduces SAMD and SAMI, novel methods for identifying and manipulating attention modules in transformers to interpret complex concepts, enhance performance, and control model behavior across language and vision tasks.
Contribution
The paper presents a unified, concept-agnostic approach to map and intervene on attention modules in transformers, addressing complex concepts beyond simple factual associations.
Findings
SAMD accurately identifies concept-related attention heads across models.
SAMI effectively modulates model behavior by amplifying or diminishing concepts.
The methods are domain-agnostic, applicable to both language and vision transformers.
Abstract
Transformers have achieved state-of-the-art performance across language and vision tasks. This success drives the imperative to interpret their internal mechanisms with the dual goals of enhancing performance and improving behavioral control. Attribution methods help advance interpretability by assigning model outputs associated with a target concept to specific model components. Current attribution research primarily studies multi-layer perceptron neurons and addresses relatively simple concepts such as factual associations (e.g., Paris is located in France). This focus tends to overlook the impact of the attention mechanism and lacks a unified approach for analyzing more complex concepts. To fill these gaps, we introduce Scalable Attention Module Discovery (SAMD), a concept-agnostic method for mapping arbitrary, complex concepts to specific attention heads of general transformer…
Peer Reviews
Decision·ICLR 2026 Poster
* Novel and elegant method for circuit discovery * Clear presentation of the findings * Demonstration of effectiveness of method on a broad variety of applications over two different modalities. Especially in the vision literature this is addressing a research gap, as vision-circuit discovery remains under-explored.
* 4.2 the construction of $D_p$ is unclear from just reading the main body of the paper. * Concept figure should be improved * Font way to small * No order of panels provided * SAMI is not explained in the rightmost panel * Comparison to baseline such as e.g. difference in means is missing for 4.1, 4.2 and 4.4. If this concern is addressed appropriately I will improve my score. * In 4.2 the authors only report evals on the dataset that they used for construction. An OOD reasoning benchmark e
Very interesting method with good novelty. Unlike many previous MLP neuron attribution approaches, this is the first I have seen which identifies attention concepts. This is an interesting and critical result with the stronger push for mechanistic interpretability and adjacent approaches in the modern XAI literature. Well written with extensive experimental results. I think the simplicity of the approach is a benefit to its usability. I feel that I could replicate this with a few hours of w
Minor – plainly calling this an attribution method feels misaligned with the literature. Attribution methods often refer to input (feature) attribution. This is more aligned with neuron attribution. Perhaps it should be attention attribution but not to be confused with input attribution using attention weights/gradients. There are not any true comparisons against other methods. It is hard to tell if this should be negative because it may be challenging to create a fair comparison against a sim
- Introducing attention-head–level concept attribution is an original direction that is computationally light and easily applicable to diverse transformer architectures. - The same pipeline is used across text and vision models, suggesting potential generality and extensibility. - The paper contributes to ongoing efforts to connect internal transformer components to semantic behaviors, particularly through sparse, interpretable “modules.”
- The evaluation is dominated by qualitative visualizations and anecdotal examples. There are no robust statistical analyses, reproducible metrics, or causal validation to confirm that the discovered modules truly mediate the claimed concepts. - The use of cosine similarity as a proxy for conceptual alignment is not theoretically or empirically justified; results may reflect correlation, not causation. - Choices of K (number of heads) and s (scaling factor) appear arbitrary, tuned via small
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Domain Adaptation and Few-Shot Learning
