Does Transformer Interpretability Transfer to RNNs?
Gon\c{c}alo Paulo, Thomas Marshall, Nora Belrose

TL;DR
This paper investigates whether interpretability methods developed for transformer models are effective on new recurrent architectures like Mamba and RWKV, finding that most techniques transfer well and can be improved by leveraging RNNs' compressed states.
Contribution
The study demonstrates that transformer interpretability techniques are applicable to modern RNNs and shows how to enhance them by exploiting RNNs' compressed state representations.
Findings
Most interpretability methods transfer effectively to RNNs.
Improving techniques by leveraging RNNs' compressed states enhances interpretability.
RNNs can be steered and analyzed similarly to transformers using these methods.
Abstract
Recent advances in recurrent neural network architectures, such as Mamba and RWKV, have enabled RNNs to match or exceed the performance of equal-size transformers in terms of language modeling perplexity and downstream evaluations, suggesting that future systems may be built on completely new architectures. In this paper, we examine if selected interpretability methods originally designed for transformer language models will transfer to these up-and-coming recurrent architectures. Specifically, we focus on steering model outputs via contrastive activation addition, on eliciting latent predictions via the tuned lens, and eliciting latent knowledge from models fine-tuned to produce false outputs under certain conditions. Our results show that most of these techniques are effective when applied to RNNs, and we show that it is possible to improve some of them by taking advantage of RNNs'…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInterpreting and Communication in Healthcare · Nursing Diagnosis and Documentation
MethodsFocus
