Because we have LLMs, we Can and Should Pursue Agentic Interpretability

Been Kim; John Hewitt; Neel Nanda; Noah Fiedel; Oyvind Tafjord

arXiv:2506.12152·cs.AI·June 17, 2025

Because we have LLMs, we Can and Should Pursue Agentic Interpretability

Been Kim, John Hewitt, Neel Nanda, Noah Fiedel, Oyvind Tafjord

PDF

Open Access

TL;DR

This paper explores agentic interpretability enabled by LLMs, where models proactively assist human understanding through multi-turn conversations, fostering better mental models of AI systems.

Contribution

It introduces the concept of agentic interpretability, highlighting its potential, challenges, and the need for new evaluation methods in human-AI collaborative understanding.

Findings

01

Agentic interpretability leverages LLMs to teach and explain beyond traditional methods.

02

It enables humans to develop better mental models of AI systems.

03

Challenges include evaluation due to human-in-the-loop complexity.

Abstract

The era of Large Language Models (LLMs) presents a new opportunity for interpretability--agentic interpretability: a multi-turn conversation with an LLM wherein the LLM proactively assists human understanding by developing and leveraging a mental model of the user, which in turn enables humans to develop better mental models of the LLM. Such conversation is a new capability that traditional `inspective' interpretability methods (opening the black-box) do not use. Having a language model that aims to teach and explain--beyond just knowing how to talk--is similar to a teacher whose goal is to teach well, understanding that their success will be measured by the student's comprehension. While agentic interpretability may trade off completeness for interactivity, making it less suitable for high-stakes safety situations with potentially deceptive models, it leverages a cooperative model to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Agent Systems and Negotiation · Natural Language Processing Techniques · European and International Law Studies