Extracting Rule-based Descriptions of Attention Features in Transformers

Dan Friedman; Adithya Bhaskar; Alexander Wettig; Danqi Chen

arXiv:2510.18148·cs.CL·October 22, 2025

Extracting Rule-based Descriptions of Attention Features in Transformers

Dan Friedman, Adithya Bhaskar, Alexander Wettig, Danqi Chen

PDF

Open Access 4 Reviews

TL;DR

This paper introduces a method to extract rule-based descriptions of attention features in transformers, enabling more interpretable explanations of model behavior through token pattern rules.

Contribution

It proposes a novel approach for automatically extracting rule-based descriptions of attention features, including skip-gram, absence, and counting rules, from transformer models like GPT-2.

Findings

01

Majority of features described by around 100 skip-gram rules

02

Absence rules are prevalent even in early layers

03

Counting rules are identified in some features

Abstract

Mechanistic interpretability strives to explain model behavior in terms of bottom-up primitives. The leading paradigm is to express hidden states as a sparse linear combination of basis vectors, called features. However, this only identifies which text sequences (exemplars) activate which features; the actual interpretation of features requires subjective inspection of these exemplars. This paper advocates for a different solution: rule-based descriptions that match token patterns in the input and correspondingly increase or decrease the likelihood of specific output tokens. Specifically, we extract rule-based descriptions of SAE features trained on the outputs of attention layers. While prior work treats the attention layers as an opaque box, we describe how it may naturally be expressed in terms of interactions between input and output features, of which we study three types: (1)…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 6Confidence 3

Strengths

- The paper looks at feature level for behaviors, such as skip-gram patterns, that are frequently studied in circuit analysis, providing an interesting view from the SAE lens, making the analysis better scalable. - It identifies the discovered rules on a broad set of sequences, indicating robustness of these features. - It focuses on a clearly defined set of rules that can be sensibly studied.

Weaknesses

- The evaluation, while on a broad set of samples is qualitatively limited to binary activations; - The paper does not compare to traditional circuit analysis approaches, which could analyse the same patterns, showing how much they overlap. - Similarly, there are no ablations on other methods beyond SAEs - While the paper evaluates how often the patterns are present in the collected data, it would be interesting to see how they are present in real-world tasks. - The approach uses a single, sma

Reviewer 02Rating 2Confidence 3

Strengths

- The paper falls within the mechanistic interpretability tradition and builds on the work of Kissane et al 2024 with sparse autoencoders. The suggested method for finding rules goes beyond manual exemplar inspection and is thus able to find rules which are hard to identify manually - While the details of the methodology are somewhat obscure in places the general approach seems reasonable and sound. - The results may be of some interest given the amount of attention to mechanistic interpretabili

Weaknesses

- The main contribution is the procedure for finding rules: while it's an improvement over manual examplar examination, it is still based on quite strong priors and hard-coded search patterns and assumptions. - The advertised symbolic and interpretable nature of the found rules only really applies to the bottom transformer layer where inputs can be directly linked to input tokens. In layers above, the interpretation of the features and rules becomes increasingly murky. - The presentation relies

Reviewer 03Rating 4Confidence 3

Strengths

- The paper proposes a new method to interpret SAEs features. It differs from previous work especially in the treatment of the QK features, trying to analyze feature interactions that contributes to the attention pattern formation. - The method is capable of finding interesting set of features that it may be hard to do with autointerp.

Weaknesses

1. Some of the claims relies on weak evidence: - The evidence for counting rules is a single qualitative example (Figure 7). - In the absence rules analysis, the counterfactual validation (Fig 6c) is limited to only the first attention layer. 2. The entire evaluation is limited to one model, GPT-2 small. It is not clear if this methodology will apply to larger models. 3. The method for ranking rules relies on heuristics, like picking the "top 100" features to reduce the search space. The

Reviewer 04Rating 4Confidence 2

Strengths

The paper is generally well-written and does a good job of contextualizing itself with respect to prior work. Section 3.1 does a good job of presenting the main idea of the paper, which appears to be an original and significant extension of mechanistic interpretability to the attention mechanism. The authors test multiple ranking strategies for pruning.

Weaknesses

1. The method is limited in its expressivity; the types of rule extracted are quite simple, and it does not appear to be straightforward to extend it to take into account other parts of the transformer architecture besides the attention mechanism, such as the feedforward layers. However, this may not be a very serious issue, as mechanistic interpretability methods typically rely on these simplifications, and 3.1 does a good job of justifying the types of rules they study. 1. I'm not sure how to

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Natural Language Processing Techniques