Tracr: Compiled Transformers as a Laboratory for Interpretability

David Lindner; J\'anos Kram\'ar; Sebastian Farquhar; Matthew; Rahtz; Thomas McGrath; Vladimir Mikulik

arXiv:2301.05062·cs.LG·November 6, 2023·6 cites

Tracr: Compiled Transformers as a Laboratory for Interpretability

David Lindner, J\'anos Kram\'ar, Sebastian Farquhar, Matthew, Rahtz, Thomas McGrath, Vladimir Mikulik

PDF

Open Access 1 Repo 1 Video

TL;DR

Tracr introduces a method to compile human-readable programs into transformer models, creating structures that facilitate interpretability experiments and ground-truth evaluation of interpretability methods.

Contribution

The paper presents Tracr, a compiler that generates transformer models with known, interpretable structures for research and evaluation purposes.

Findings

01

Tracr-compiled models accurately implement the specified programs.

02

The known structure enables targeted interpretability experiments.

03

Models successfully perform tasks like token frequency counting, sorting, and parenthesis checking.

Abstract

We show how to "compile" human-readable programs into standard decoder-only transformer models. Our compiler, Tracr, generates models with known structure. This structure can be used to design experiments. For example, we use it to study "superposition" in transformers that execute multi-step algorithms. Additionally, the known structure of Tracr-compiled models can serve as ground-truth for evaluating interpretability methods. Commonly, because the "programs" learned by transformers are unknown it is unclear whether an interpretation succeeded. We demonstrate our approach by implementing and examining programs including computing token frequencies, sorting, and parenthesis checking. We provide an open-source implementation of Tracr at https://github.com/google-deepmind/tracr.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

google-deepmind/tracr
jaxOfficial

Videos

Tracr: Compiled Transformers as a Laboratory for Interpretability· slideslive

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Machine Learning and Data Classification