Extracting Rule-based Descriptions of Attention Features in Transformers
Dan Friedman, Adithya Bhaskar, Alexander Wettig, Danqi Chen

TL;DR
This paper introduces a method to extract rule-based descriptions of attention features in transformers, enabling more interpretable explanations of model behavior through token pattern rules.
Contribution
It proposes a novel approach for automatically extracting rule-based descriptions of attention features, including skip-gram, absence, and counting rules, from transformer models like GPT-2.
Findings
Majority of features described by around 100 skip-gram rules
Absence rules are prevalent even in early layers
Counting rules are identified in some features
Abstract
Mechanistic interpretability strives to explain model behavior in terms of bottom-up primitives. The leading paradigm is to express hidden states as a sparse linear combination of basis vectors, called features. However, this only identifies which text sequences (exemplars) activate which features; the actual interpretation of features requires subjective inspection of these exemplars. This paper advocates for a different solution: rule-based descriptions that match token patterns in the input and correspondingly increase or decrease the likelihood of specific output tokens. Specifically, we extract rule-based descriptions of SAE features trained on the outputs of attention layers. While prior work treats the attention layers as an opaque box, we describe how it may naturally be expressed in terms of interactions between input and output features, of which we study three types: (1)…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- The paper looks at feature level for behaviors, such as skip-gram patterns, that are frequently studied in circuit analysis, providing an interesting view from the SAE lens, making the analysis better scalable. - It identifies the discovered rules on a broad set of sequences, indicating robustness of these features. - It focuses on a clearly defined set of rules that can be sensibly studied.
- The evaluation, while on a broad set of samples is qualitatively limited to binary activations; - The paper does not compare to traditional circuit analysis approaches, which could analyse the same patterns, showing how much they overlap. - Similarly, there are no ablations on other methods beyond SAEs - While the paper evaluates how often the patterns are present in the collected data, it would be interesting to see how they are present in real-world tasks. - The approach uses a single, sma
- The paper falls within the mechanistic interpretability tradition and builds on the work of Kissane et al 2024 with sparse autoencoders. The suggested method for finding rules goes beyond manual exemplar inspection and is thus able to find rules which are hard to identify manually - While the details of the methodology are somewhat obscure in places the general approach seems reasonable and sound. - The results may be of some interest given the amount of attention to mechanistic interpretabili
- The main contribution is the procedure for finding rules: while it's an improvement over manual examplar examination, it is still based on quite strong priors and hard-coded search patterns and assumptions. - The advertised symbolic and interpretable nature of the found rules only really applies to the bottom transformer layer where inputs can be directly linked to input tokens. In layers above, the interpretation of the features and rules becomes increasingly murky. - The presentation relies
- The paper proposes a new method to interpret SAEs features. It differs from previous work especially in the treatment of the QK features, trying to analyze feature interactions that contributes to the attention pattern formation. - The method is capable of finding interesting set of features that it may be hard to do with autointerp.
1. Some of the claims relies on weak evidence: - The evidence for counting rules is a single qualitative example (Figure 7). - In the absence rules analysis, the counterfactual validation (Fig 6c) is limited to only the first attention layer. 2. The entire evaluation is limited to one model, GPT-2 small. It is not clear if this methodology will apply to larger models. 3. The method for ranking rules relies on heuristics, like picking the "top 100" features to reduce the search space. The
The paper is generally well-written and does a good job of contextualizing itself with respect to prior work. Section 3.1 does a good job of presenting the main idea of the paper, which appears to be an original and significant extension of mechanistic interpretability to the attention mechanism. The authors test multiple ranking strategies for pruning.
1. The method is limited in its expressivity; the types of rule extracted are quite simple, and it does not appear to be straightforward to extend it to take into account other parts of the transformer architecture besides the attention mechanism, such as the feedforward layers. However, this may not be a very serious issue, as mechanistic interpretability methods typically rely on these simplifications, and 3.1 does a good job of justifying the types of rules they study. 1. I'm not sure how to
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Natural Language Processing Techniques
