TracrBench: Generating Interpretability Testbeds with Large Language   Models

Hannes Thurnherr; J\'er\'emy Scheurer

arXiv:2409.13714·cs.CL·September 24, 2024

TracrBench: Generating Interpretability Testbeds with Large Language Models

Hannes Thurnherr, J\'er\'emy Scheurer

PDF

Open Access 1 Repo

TL;DR

TracrBench introduces a new dataset of transformer models and RASP programs, generated with large language models, to facilitate the evaluation of interpretability methods in transformer-based language models.

Contribution

The paper presents TracrBench, a novel dataset of 121 RASP programs and transformer weights, created with LLMs and human validation, to serve as a ground truth testbed for interpretability research.

Findings

01

GPT-4-turbo correctly implements 57 out of 101 programs

02

Generating RASP programs with LLMs is challenging and often requires manual effort

03

TracrBench provides a valuable resource for evaluating interpretability methods.

Abstract

Achieving a mechanistic understanding of transformer-based language models is an open challenge, especially due to their large number of parameters. Moreover, the lack of ground truth mappings between model weights and their functional roles hinders the effective evaluation of interpretability methods, impeding overall progress. Tracr, a method for generating compiled transformers with inherent ground truth mappings in RASP, has been proposed to address this issue. However, manually creating a large number of models needed for verifying interpretability methods is labour-intensive and time-consuming. In this work, we present a novel approach for generating interpretability test beds using large language models (LLMs) and introduce TracrBench, a novel dataset consisting of 121 manually written and LLM-generated, human-validated RASP programs and their corresponding transformer weights.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

HannesThurnherr/TracrBench
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling