Automated Interpretability and Feature Discovery in Language Models with Agents
Arnau Marin-Llobet, Javier Ferrando

TL;DR
This paper presents an autonomous multiagent system that automates interpretability and feature discovery in large language models through iterative explanation refinement and feature generation.
Contribution
It introduces a novel multiagent framework that automates mechanistic interpretability, improving explanation quality and discovering language-specific features.
Findings
Agent-driven loops produce sharper, more falsifiable explanations.
System discovers language-specific and safety-relevant features.
Outperforms one-shot auto-interpretations.
Abstract
We introduce an autonomous multiagent framework for mechanistic interpretability that automates both explaining and finding internal features in large language models. The system runs two coupled loops: (1) explanation refinement, where an agent proposes competing hypotheses and iteratively tests them with targeted prompt controls and a multi-metric evaluation; and (2) feature discovery, where an agent generates prompt sets, constructs a k-nearest-neighbor graph in activation space, and retrieves candidate features using statistical separability and semantic coherence criteria. On Gemma-2 family models and MLP neurons in weight-sparse transformers, our agent improves over one-shot auto-interpretations, discovers language-specific and safety-relevant features, and produces auditable explanation traces, showing that agent-driven empirical loops yield sharper and more falsifiable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
