Automated Interpretability and Feature Discovery in Language Models with Agents

Arnau Marin-Llobet; Javier Ferrando

arXiv:2605.01555·cs.CL·May 5, 2026

Automated Interpretability and Feature Discovery in Language Models with Agents

Arnau Marin-Llobet, Javier Ferrando

PDF

TL;DR

This paper presents an autonomous multiagent system that automates interpretability and feature discovery in large language models through iterative explanation refinement and feature generation.

Contribution

It introduces a novel multiagent framework that automates mechanistic interpretability, improving explanation quality and discovering language-specific features.

Findings

01

Agent-driven loops produce sharper, more falsifiable explanations.

02

System discovers language-specific and safety-relevant features.

03

Outperforms one-shot auto-interpretations.

Abstract

We introduce an autonomous multiagent framework for mechanistic interpretability that automates both explaining and finding internal features in large language models. The system runs two coupled loops: (1) explanation refinement, where an agent proposes competing hypotheses and iteratively tests them with targeted prompt controls and a multi-metric evaluation; and (2) feature discovery, where an agent generates prompt sets, constructs a k-nearest-neighbor graph in activation space, and retrieves candidate features using statistical separability and semantic coherence criteria. On Gemma-2 family models and MLP neurons in weight-sparse transformers, our agent improves over one-shot auto-interpretations, discovers language-specific and safety-relevant features, and produces auditable explanation traces, showing that agent-driven empirical loops yield sharper and more falsifiable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.