TracrBench: Generating Interpretability Testbeds with Large Language Models
Hannes Thurnherr, J\'er\'emy Scheurer

TL;DR
TracrBench introduces a new dataset of transformer models and RASP programs, generated with large language models, to facilitate the evaluation of interpretability methods in transformer-based language models.
Contribution
The paper presents TracrBench, a novel dataset of 121 RASP programs and transformer weights, created with LLMs and human validation, to serve as a ground truth testbed for interpretability research.
Findings
GPT-4-turbo correctly implements 57 out of 101 programs
Generating RASP programs with LLMs is challenging and often requires manual effort
TracrBench provides a valuable resource for evaluating interpretability methods.
Abstract
Achieving a mechanistic understanding of transformer-based language models is an open challenge, especially due to their large number of parameters. Moreover, the lack of ground truth mappings between model weights and their functional roles hinders the effective evaluation of interpretability methods, impeding overall progress. Tracr, a method for generating compiled transformers with inherent ground truth mappings in RASP, has been proposed to address this issue. However, manually creating a large number of models needed for verifying interpretability methods is labour-intensive and time-consuming. In this work, we present a novel approach for generating interpretability test beds using large language models (LLMs) and introduce TracrBench, a novel dataset consisting of 121 manually written and LLM-generated, human-validated RASP programs and their corresponding transformer weights.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
