Findings of the BlackboxNLP 2025 Shared Task: Localizing Circuits and Causal Variables in Language Models

Dana Arad; Yonatan Belinkov; Hanjie Chen; Najoung Kim; Hosein Mohebbi; Aaron Mueller; Gabriele Sarti; Martin Tutek

arXiv:2511.18409·cs.CL·November 25, 2025

Findings of the BlackboxNLP 2025 Shared Task: Localizing Circuits and Causal Variables in Language Models

Dana Arad, Yonatan Belinkov, Hanjie Chen, Najoung Kim, Hosein Mohebbi, Aaron Mueller, Gabriele Sarti, Martin Tutek

PDF

Open Access

TL;DR

This paper reports on the BlackboxNLP 2025 Shared Task, which advances the evaluation of mechanistic interpretability in language models through community-wide benchmarking of circuit and causal variable localization methods.

Contribution

It extends the Mechanistic Interpretability Benchmark into a shared task, providing a standardized, reproducible framework for comparing diverse MI techniques across multiple teams.

Findings

01

Ensemble and regularization strategies improved circuit localization.

02

Low-dimensional, non-linear projections enhanced causal variable localization.

03

Multiple methods showed significant progress in localizing influential model components.

Abstract

Mechanistic interpretability (MI) seeks to uncover how language models (LMs) implement specific behaviors, yet measuring progress in MI remains challenging. The recently released Mechanistic Interpretability Benchmark (MIB; Mueller et al., 2025) provides a standardized framework for evaluating circuit and causal variable localization. Building on this foundation, the BlackboxNLP 2025 Shared Task extends MIB into a community-wide reproducible comparison of MI techniques. The shared task features two tracks: circuit localization, which assesses methods that identify causally influential components and interactions driving model behavior, and causal variable localization, which evaluates approaches that map activations into interpretable features. With three teams spanning eight different methods, participants achieved notable gains in circuit localization using ensemble and regularization…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Multimodal Machine Learning Applications