Mechanistic evaluation of Transformers and state space models

Aryaman Arora; Neil Rathi; Nikil Roashan Selvam; R\'obert Csord\'as; Dan Jurafsky; Christopher Potts

arXiv:2505.15105·cs.CL·February 2, 2026

Mechanistic evaluation of Transformers and state space models

Aryaman Arora, Neil Rathi, Nikil Roashan Selvam, R\'obert Csord\'as, Dan Jurafsky, Christopher Potts

PDF

Open Access 3 Reviews

TL;DR

This paper compares Transformers and state space models in language tasks, revealing that Transformers and Based SSMs succeed in associative recall by using induction, while others fail, highlighting the importance of mechanistic understanding.

Contribution

It provides a mechanistic analysis of different architectures' ability to perform associative recall, identifying how induction mechanisms contribute to success.

Findings

01

Transformers and Based SSMs fully succeed at associative recall.

02

Mamba implements induction via short convolutions, not SSMs.

03

Architectures with similar accuracy can have different underlying mechanisms.

Abstract

State space models (SSMs) for language modelling promise an efficient and performant alternative to quadratic-attention Transformers, yet show variable performance on recalling basic information from the context. While performance on synthetic tasks like Associative Recall (AR) can point to this deficiency, behavioural metrics provide little information as to \textit{why} -- on a mechanistic level -- certain architectures fail and others succeed. To address this, we conduct experiments on AR, and find that only Transformers and Based SSM models fully succeed at AR, with Mamba and DeltaNet close behind, while the other SSMs (H3, Hyena) fail. We then use causal interventions to explain why. We find that Transformers and Based learn to store key-value associations in-context using induction. By contrast, the SSMs seem to compute these associations only at the last state using a single…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

- The intervention protocol pinpoints the mechanism - more specifically, the QKV positioned restorations at layer input/outputs disambiguate induction from direct retrieval. - The kernel size ablation study demonstrates that short receptive-field convolution kernels implement the association needed for AR. - With the new task (ATR), the paper verifies that the mechanism transfer to a harder task without positional dependence.

Weaknesses

- The experiment results seem quite noisy. For figure 4, in particular, it is unclear why model dim 128 and Mamba Conv. = 2 suddenly fails. Also, it is unclear why Restored @ Key value for this configuration is notably high compared to other entries. A similar trend is observed for Figure 5, where model configurations that perform well can drastically fail, (~0% accuracy) with learning rates that are slightly modified. This result, unless it can be justified empirically or theoretically, raises

Reviewer 02Rating 6Confidence 3

Strengths

1. comprehensively and mechanically evaluate common linear models and transformer. 2. clear and good writing.

Weaknesses

1. mechanic metric are not used to provide guidance for the design of architecture but can only help understanding. thus its use is limited.

Reviewer 03Rating 2Confidence 5

Strengths

Studying basic performance tasks on transformers vs new sequence models is interesting. The authors train a large set of models and carry out the analysis also on ATR, which is a less common setup that I did not know before, but I find quite insightful. The paper is also pleasant to read and schematic, which helps deliver the message. Plots are clear.

Weaknesses

The findings in the paper have quite a few overlaps with previous works: - Zoology: https://arxiv.org/abs/2312.04927 - Based: https://arxiv.org/abs/2402.18668 - Convolution-augmented transformers: https://arxiv.org/abs/2407.05591 - Revisiting Associative Recall: https://openreview.net/pdf/f7e9f322ba15e88dcc818ab70866648650a5e319.pdf - H3 : https://arxiv.org/pdf/2212.14052 In light of the findings in the papers above, I did not find the paper very surprising. The authors cite all the papers abo

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsProbabilistic and Robust Engineering Design · Simulation Techniques and Applications · Fault Detection and Control Systems

MethodsConvolution · Mamba: Linear-Time Sequence Modeling with Selective State Spaces